DEV Community

Cover image for RAG Explained for SQL Developers: Think of It as SELECT, But for Meaning
Vikas Prabhu
Vikas Prabhu

Posted on

RAG Explained for SQL Developers: Think of It as SELECT, But for Meaning

If you've spent years writing SELECT statements, tuning execution plans, and building B-Tree indexes, you already understand 80% of what RAG does — you just don't know it yet.

RAG (Retrieval-Augmented Generation) is essentially a retrieval pipeline, just like SQL. The difference? Instead of WHERE category = 'Small Cap', it does WHERE meaning ≈ 'what are the risks'. Instead of returning rows, it returns document passages. And instead of your application rendering a table, an LLM synthesizes a human-readable answer.

This article maps every RAG concept to the SQL equivalent you already know — from storage to indexing to query execution — so it clicks immediately.


The Mental Model

Here's the mapping that'll make everything click:

You Know This (SQL) Now Learn This (RAG)
CREATE TABLE + INSERT rows Chunk documents + Store embeddings
CREATE INDEX (B-Tree) Vector index (HNSW)
WHERE column = value Cosine similarity search
ROWID → fetch row from data block Chunk ID → fetch text passage
Execution plan (CBO optimizer) Embed query → Top-K nearest neighbors
ResultSet → render table in app Retrieved chunks → LLM generates answer

Now let's walk through each phase.


Phase 1: Data Sources — What You're Storing

🟡 SQL — You know this

You work with structured, schema-defined data. Every row has the same columns, every column has a defined type:

  • NAV tablesfund_id, nav_date, nav_value (NUMBER)
  • Portfolio holdingssecurity_id, weight_pct, market_value
  • Transaction ledgerstxn_id, amount, txn_type, timestamp
  • Client masterclient_id, name, PAN, risk_profile
  • Fund performancefund_id, period, return_pct, benchmark_pct

Clean, typed, relational. This is your comfort zone.

🔵 RAG — Now imagine this

But what about all the data that doesn't fit into rows and columns?

  • SEBI circulars & regulatory filings — PDF documents with legal text
  • Fund factsheets & offer documents — Multi-page PDFs with tables mixed into prose
  • Research reports & analyst notes — Word docs, emails, presentations
  • Board meeting minutes — Free-form narrative text
  • Annual reports & MD&A sections — Long-form financial commentary

You can't SELECT * FROM sebi_circular WHERE topic = 'liquidity risk' — because there's no table, no columns, no schema. This is exactly the problem RAG solves. It makes unstructured documents queryable, just like SQL makes structured tables queryable.


Phase 2: Data Ingestion & Storage

🟡 SQL — You know this

You define a schema, create tables, and insert rows. Each row gets a physical address (ROWID) in Oracle's data blocks.

-- Step 1: Define schema (DDL)
CREATE TABLE funds (
    fund_id      NUMBER        PRIMARY KEY,
    fund_name    VARCHAR2(200) NOT NULL,
    category     VARCHAR2(50),
    risk_rating  VARCHAR2(20),
    aum_crores   NUMBER(15,2),
    inception_dt DATE
) TABLESPACE fund_data;

-- Step 2: Insert structured rows
INSERT INTO funds VALUES (
    1001,
    'Axis Small Cap Fund',
    'Small Cap',
    'Very High',
    18542.30,
    DATE '2013-11-05'
);

-- Step 3: Data stored in Oracle data blocks
-- Each row → ROWID (physical address)
-- e.g., AAABnPAAEAAAAJHAAA
-- Format: Object | File | Block | Row
Enter fullscreen mode Exit fullscreen mode

Summary: Schema → typed rows → physical storage in data blocks → each row addressable by ROWID.

🔵 RAG — Now here's the equivalent

RAG does the same thing, but for documents. Instead of rows, you store chunks. Instead of ROWIDs, you store vectors.

# Step 1: Load the document (like reading a CSV into a staging table)
doc = load_pdf("SEBI_circular_2024.pdf")

# Step 2: Split into chunks (like breaking one big row into normalized pieces)
chunks = text_splitter(
    doc,
    chunk_size=512,       # ~500 tokens per chunk
    chunk_overlap=50      # overlap so context isn't lost at boundaries
)
# Result: ["Small-cap funds carry higher...",
#          "SEBI mandates stress testing...",
#          "Liquidity risk management..." ]

# Step 3: Convert each chunk into a vector — THIS IS THE KEY STEP
# Think of it as: computing a "semantic ROWID" for each chunk
vectors = embedding_model.encode(chunks)
# Each chunk → [0.023, -0.187, 0.441, ...] (1536 floating-point numbers)

# Step 4: Store in vector database (like INSERT INTO ... VALUES)
vector_db.upsert(
    ids=["chunk_001", "chunk_002", ...],     # like primary keys
    embeddings=vectors,                       # like the "semantic ROWID"
    metadata=[{"source": "SEBI_2024.pdf", "page": 3}, ...]  # like extra columns
)
Enter fullscreen mode Exit fullscreen mode

The SQL developer's translation:

SQL Concept RAG Equivalent
CREATE TABLE with typed columns Define chunk size + embedding model
INSERT INTO a row upsert() a chunk + its vector
ROWID (physical disk address) Vector embedding (semantic address in meaning-space)
Data block in tablespace Vector stored in Pinecone / ChromaDB / pgvector
Extra columns (category, date) Metadata fields (source file, page number)

Key insight: A ROWID tells Oracle where a row lives on disk. A vector embedding tells the vector DB what a chunk means. Both are addresses — one physical, one semantic.


Phase 3: Indexing — Making It Searchable

Without indexes, both systems do full scans. You know this pain. Let's see how each solves it.

🟡 SQL — You know this

-- B-Tree Index (your bread and butter)
CREATE INDEX idx_funds_category
ON funds(category)
TABLESPACE fund_idx;

-- Composite Index (multi-column)
CREATE INDEX idx_perf_fund_date
ON fund_performance(fund_id, period_type, period);

-- Bitmap Index (for low-cardinality columns)
CREATE BITMAP INDEX idx_funds_risk
ON funds(risk_rating);
-- Great for: 'Very High', 'High', 'Moderate' (few distinct values)

-- Function-Based Index
CREATE INDEX idx_funds_upper_name
ON funds(UPPER(fund_name));
-- Speeds up: WHERE UPPER(fund_name) LIKE 'AXIS%'
Enter fullscreen mode Exit fullscreen mode

How B-Tree works: Root → branch → leaf → ROWID → fetch row. For 1 million rows, it takes ~3-4 hops. Deterministic, exact, fast.

🔵 RAG — Now here's the equivalent

Instead of a B-Tree that organizes values in sorted order, RAG uses a vector index that organizes embeddings by semantic proximity.

# HNSW Index (the "B-Tree of the vector world")
# — Most popular, best recall/speed tradeoff
index = create_index(
    type="hnsw",           # Hierarchical Navigable Small World
    metric="cosine",       # similarity metric (like ASC/DESC but for meaning)
    ef_construction=200,   # build quality (like PCT_FREE for B-Trees)
    M=16                   # connections per node (like branching factor)
)

# Other index types:
# IVF  — Clusters vectors first, searches nearby clusters (like partitioning)
# FLAT — Brute force, checks ALL vectors (like a full table scan)
Enter fullscreen mode Exit fullscreen mode

How HNSW works: Imagine a multi-layer graph. Top layer = few nodes (coarse navigation). Bottom layer = all nodes (fine detail). Search starts at top → hops to closest node → drops a layer → repeats → arrives near the target.

The SQL developer's translation:

SQL Index Concept RAG Vector Index Equivalent
B-Tree (sorted values) HNSW (semantic proximity graph)
Bitmap (low-cardinality) IVF (clustered vectors)
Full Table Scan FLAT index (brute-force vector scan)
INDEX RANGE SCAN Approximate Nearest Neighbor (ANN) search
100% exact match 95-99% approximate match (tradeoff for speed)
ANALYZE TABLE for CBO stats Index tuning: ef_construction, M params

Key difference: Your B-Tree finds WHERE category = 'Small Cap' — an exact match. A vector index finds WHERE meaning ≈ 'risks of small-cap investing' — a semantic match. "Risk" matches "volatility", "danger", "exposure" — even though the words are different.


Phase 4: Query Execution

A user asks a question. Let's trace the execution path.

🟡 SQL — You know this

-- User wants: "Show me small-cap funds with low liquidity"
-- Translated to SQL:
SELECT fund_name, liquidity_ratio, avg_volume
FROM   funds f
JOIN   fund_liquidity fl ON f.fund_id = fl.fund_id
WHERE  f.category = 'Small Cap'
  AND  fl.liquidity_ratio < 0.5
ORDER BY fl.liquidity_ratio ASC;

-- Oracle CBO generates execution plan:
----------------------------------------------
-- | Id | Operation               | Cost |
-- |----|-------------------------|------|
-- |  0 | SELECT STATEMENT        |   15 |
-- |  1 |  SORT ORDER BY          |   15 |
-- |  2 |   NESTED LOOPS          |   12 |
-- |  3 |    INDEX RANGE SCAN     |    3 |
-- |    |     idx_funds_category  |      |
-- |  4 |    TABLE ACCESS ROWID   |    1 |
-- |    |     funds               |      |
-- |  5 |    INDEX RANGE SCAN     |    2 |
-- |    |     idx_liq_fund_ratio  |      |
-- |  6 |    TABLE ACCESS ROWID   |    1 |
-- |    |     fund_liquidity      |      |
----------------------------------------------
Enter fullscreen mode Exit fullscreen mode

Flow: Parse SQL → CBO picks cheapest plan → index scan → get ROWIDs → fetch rows → sort → return ResultSet. Deterministic. Same query, same result, every time.

🔵 RAG — Now here's the equivalent

# User asks (in plain English, no SQL needed):
# "What are the liquidity risks of small-cap funds?"

# Step 1: Embed the question — like the CBO "parsing" your query
# But instead of parsing SQL syntax, it converts MEANING into a vector
query_vec = embed("What are the liquidity risks of small-cap funds?")
# → [0.034, -0.221, 0.187, ...] (1536-d)

# Step 2: Search the vector index — like INDEX RANGE SCAN
# But instead of matching column values, it matches semantic similarity
results = vector_db.query(
    vector=query_vec,
    top_k=5,                           # like FETCH FIRST 5 ROWS ONLY
    filter={"source": "regulatory"},   # like adding a WHERE clause
    include_metadata=True
)

# Step 3: Results come back ranked by similarity score
# (like ORDER BY relevance DESC)
# chunk_042  score: 0.93  "liquidity risk in small-cap..."
# chunk_187  score: 0.89  "market depth concerns for..."
# chunk_015  score: 0.85  "redemption pressure during..."
# chunk_221  score: 0.81  "SEBI stress testing norms..."
# chunk_098  score: 0.78  "exit load and lock-in..."
Enter fullscreen mode Exit fullscreen mode

Notice: the query doesn't contain the word "market depth" or "redemption pressure" — but those chunks scored high because the meaning is related. A B-Tree could never do this. It only matches exact strings.

The SQL developer's translation:

SQL Query Concept RAG Query Equivalent
SELECT ... FROM ... WHERE vector_db.query(vector, top_k, filter)
CBO parses SQL → execution plan Embedding model converts question → vector
INDEX RANGE SCAN on B-Tree ANN search on HNSW graph
WHERE category = 'Small Cap' Cosine similarity ≈ "small-cap risks"
ORDER BY column ASC Results ranked by similarity score (0.0 → 1.0)
FETCH FIRST 5 ROWS ONLY top_k=5
AND source = 'regulatory' filter={"source": "regulatory"}

Phase 5: Retrieval — What Comes Back

🟡 SQL — You know this

You get back a typed ResultSet — exact rows, exact columns, exact values:

+------------------+-------+----------+
| FUND_NAME        | LIQ_R | AVG_VOL  |
+------------------+-------+----------+
| Quant Small Cap  | 0.31  | 1,24,500 |
| Nippon Small Cap | 0.38  | 2,87,300 |
| Axis Small Cap   | 0.42  | 3,45,100 |
| HDFC Small Cap   | 0.47  | 4,12,800 |
+------------------+-------+----------+
4 rows returned in 3ms
Enter fullscreen mode Exit fullscreen mode

Exact. Typed. Deterministic. Run it again → same 4 rows.

🔵 RAG — Now here's the equivalent

Instead of typed rows, you get back ranked text passages with relevance scores:

[
  {
    "id": "chunk_042",
    "score": 0.93,
    "text": "Small-cap funds face significant liquidity risk
      due to lower market depth. During market stress,
      bid-ask spreads can widen 3-5x compared to
      large-cap securities...",
    "metadata": {
      "source": "SEBI_circular_2024.pdf",
      "page": 7,
      "date": "2024-03-15"
    }
  },
  {
    "id": "chunk_187",
    "score": 0.89,
    "text": "Redemption pressure in mid and small-cap
      categories reached ₹8,200 Cr in Q4, forcing
      fund managers to sell positions at unfavorable
      prices...",
    "metadata": { "..." }
  }
]
Enter fullscreen mode Exit fullscreen mode

The SQL developer's translation:

SQL ResultSet RAG Retrieval
Rows with typed columns Text chunks with similarity scores
rs.getString("FUND_NAME") result["text"] (free-form prose)
rs.getDouble("LIQ_RATIO") result["score"] (0.0 to 1.0 relevance)
Always returns exact matches May return "near misses" (related, not exact)
Deterministic — same query, same result Probabilistic — may vary slightly
You render it in a table/chart You feed it to an LLM for synthesis

Phase 6: Response — The Final Output

🟡 SQL — You know this

Your application takes the ResultSet and renders it:

ResultSet rs = stmt.executeQuery(sql);

while (rs.next()) {
    String name = rs.getString("FUND_NAME");
    double liq  = rs.getDouble("LIQ_RATIO");
    long   vol  = rs.getLong("AVG_VOLUME");
    // render in UI table, chart, or report
}
Enter fullscreen mode Exit fullscreen mode

Output: A data table, a chart, a CSV export. The numbers, not the narrative. The human analyst interprets what the data means.

🔵 RAG — Now here's the extra step SQL doesn't have

RAG has one more step: an LLM reads the retrieved chunks and generates a human-readable answer. Think of it as an AI analyst who reads the data and writes the report for you.

# The retrieved chunks become "context" for the LLM
# Think of it like: the LLM is a very smart stored procedure
# that reads text input and generates text output

prompt = f"""Based on the following documents,
answer the user's question.

Context:
{chunk_042.text}
{chunk_187.text}
{chunk_015.text}

Question: What are the liquidity risks?

Answer:"""

response = llm.generate(prompt)
Enter fullscreen mode Exit fullscreen mode

Output:

Small-cap funds face significant liquidity risks across three dimensions:

1. Market depth: Lower trading volumes mean bid-ask spreads can widen 3-5x during stress, per SEBI's 2024 circular.

2. Redemption pressure: Q4 saw ₹8,200 Cr in outflows from mid/small-cap categories, forcing managers to liquidate at unfavorable prices.

3. Exit constraints: Funds may impose exit loads or gate redemptions during severe stress, limiting investor flexibility.

Sources: SEBI Circular (March 2024, p.7), Fund Industry Report Q4

The SQL developer's translation:

SQL Response RAG Response
ResultSet → app renders table/chart Retrieved chunks → LLM generates prose
Human interprets the numbers LLM interprets AND explains
Shows "what" (the data) Shows "what" + "why" + "so what"
rs.next() loop in Java llm.generate(prompt) in Python
Output: rows & columns Output: paragraphs & citations

The big "aha" moment: RAG is like SQL where the ResultSet gets passed to an incredibly smart stored procedure that reads the data, understands it, and writes a report — automatically.


The Complete Pipeline — One Table to Rule Them All

Phase SQL (You Know This) RAG (Now You Know This Too)
📦 Data Structured tables, typed columns Unstructured docs, PDFs, emails
📥 Store CREATE TABLEINSERT rows → data blocks Chunk → embed() → store vectors + metadata
🗂️ Index B-Tree / Bitmap (exact value lookup) HNSW / IVF (semantic proximity graph)
🔍 Query Parse SQL → CBO plan → INDEX RANGE SCAN Embed question → cosine similarity → Top-K ANN
📤 Retrieve Exact typed rows via ROWID (deterministic) Ranked text chunks via score (probabilistic)
💬 Respond App renders table/chart (human interprets) LLM synthesizes answer (AI interprets)

When to Use Which

Question Type Use Why
"Show me funds with Sharpe > 1.0" SQL Exact numeric filtering on structured data
"What's the NAV of Axis Small Cap today?" SQL Single exact value lookup
"Top 10 holdings by weight in Fund X" SQL Ordered result set from structured tables
"Why did our fund underperform?" RAG Needs synthesis across reports and commentary
"Explain SEBI's new stress testing norms" RAG Requires reading and summarizing regulatory PDFs
"What risks should I tell my client about?" RAG Open-ended, needs contextual intelligence

The Real Power: Combine Both

The most powerful systems use both pipelines together:

  1. Receive a natural language question
  2. Route it — SQL for data questions, RAG for document questions
  3. Execute the appropriate pipeline
  4. Combine structured data + document context into a single response

Example: "Why did our mid-cap fund underperform last quarter?" triggers SQL for the performance numbers AND RAG for the attribution reports and CIO commentary — giving the user both the data and the explanation in one answer.


TL;DR

If you know SQL, you already know the shape of RAG:

SQL:  Table → INSERT row   → B-Tree index  → WHERE exact match → ResultSet    → App renders table
RAG:  Doc   → Embed chunk  → HNSW index    → WHERE meaning ≈   → Text chunks  → LLM writes answer
Enter fullscreen mode Exit fullscreen mode

Both are retrieval systems. SQL retrieves data. RAG retrieves meaning. The best architectures use both.


If this helped RAG "click" for you, drop a ❤️ — and if you're building systems that combine both, I'd love to hear about your architecture in the comments.

Top comments (0)