<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Suraj Sharma</title>
    <description>The latest articles on DEV Community by Suraj Sharma (@surajsharmaind).</description>
    <link>https://dev.to/surajsharmaind</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1778532%2Fe7537ddb-4e56-4183-9e93-b17e843d1093.png</url>
      <title>DEV Community: Suraj Sharma</title>
      <link>https://dev.to/surajsharmaind</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/surajsharmaind"/>
    <language>en</language>
    <item>
      <title>Test Post Title</title>
      <dc:creator>Suraj Sharma</dc:creator>
      <pubDate>Mon, 25 May 2026 12:33:41 +0000</pubDate>
      <link>https://dev.to/surajsharmaind/test-post-title-80a</link>
      <guid>https://dev.to/surajsharmaind/test-post-title-80a</guid>
      <description>&lt;h2&gt;
  
  
  Hello
&lt;/h2&gt;

&lt;p&gt;This is a test post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>RAG Explained: How Retrieval-Augmented Generation Actually Works</title>
      <dc:creator>Suraj Sharma</dc:creator>
      <pubDate>Mon, 25 May 2026 11:56:02 +0000</pubDate>
      <link>https://dev.to/surajsharmaind/rag-explained-how-retrieval-augmented-generation-actually-works-dd7</link>
      <guid>https://dev.to/surajsharmaind/rag-explained-how-retrieval-augmented-generation-actually-works-dd7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1huwl40mxv99gjfyy340.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1huwl40mxv99gjfyy340.png" alt="RAG Pipeline Diagram" width="463" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two Phases of RAG
&lt;/h2&gt;

&lt;p&gt;RAG (Retrieval-Augmented Generation) splits into &lt;strong&gt;two separate pipelines&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion pipeline&lt;/strong&gt; — runs once (or on a schedule) to process your documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query pipeline&lt;/strong&gt; — runs live for every user request&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Not Just Send All Your Text to the LLM?
&lt;/h2&gt;

&lt;p&gt;Three hard problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — millions of tokens per query = $$$&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context limits&lt;/strong&gt; — even 128K token windows can't hold an entire knowledge base&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt; — LLMs get confused when buried in irrelevant text&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RAG surgically extracts only the relevant &lt;strong&gt;3–5 chunks&lt;/strong&gt; needed for each question.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Store Vectors Instead of Just Doing Text Search?
&lt;/h2&gt;

&lt;p&gt;Keywords only find exact word matches. &lt;strong&gt;Vectors capture meaning.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These three phrases are completely different strings — but nearly identical vectors:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Refunds take 5 days"&lt;br&gt;
"money-back in a week"&lt;br&gt;
"reimbursement timeline: 5 business days"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They cluster close together in embedding space, which is exactly what we want.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Ingestion Pipeline (Step by Step)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqk3koyg06h2ubjlfvfud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqk3koyg06h2ubjlfvfud.png" alt="RAG Chunking Diagram" width="435" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why chunk?&lt;/strong&gt; An LLM has a fixed context window (e.g. 128K tokens). Your knowledge base could be millions of tokens. You can't send it all. Chunking lets you retrieve only the 3–5 most relevant pieces and send those — keeping the prompt small and focused. Overlap prevents losing context at chunk boundaries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Chunking&lt;/strong&gt;&lt;br&gt;
Split documents into ~500-token pieces with overlap so no idea gets cut off at a boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Embedding&lt;/strong&gt;&lt;br&gt;
The embedding model (e.g. &lt;code&gt;text-embedding-3-small&lt;/code&gt;) converts each chunk into a vector of ~1536 numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Storage&lt;/strong&gt;&lt;br&gt;
Both the vector and the original text are stored in the vector DB together — you need the text back when it's retrieved later.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Query Pipeline (Step by Step)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Embed the question&lt;/strong&gt;&lt;br&gt;
When a user asks a question, it goes through the &lt;strong&gt;exact same embedding model&lt;/strong&gt; (critical — different models produce incompatible vector spaces).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Similarity search&lt;/strong&gt;&lt;br&gt;
The resulting query vector is compared against all stored chunk vectors using &lt;strong&gt;cosine similarity&lt;/strong&gt; — essentially "which direction in space does this point?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3 — Retrieve and inject&lt;/strong&gt;&lt;br&gt;
The top-K most similar chunks are pulled out with their original text and packed into the LLM's prompt as context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why a Vector DB Specifically?
&lt;/h2&gt;

&lt;p&gt;Finding the 5 nearest vectors out of &lt;strong&gt;10 million rows&lt;/strong&gt; needs to happen in under 100ms.&lt;/p&gt;

&lt;p&gt;Algorithms like &lt;strong&gt;HNSW (Hierarchical Navigable Small World)&lt;/strong&gt; do this efficiently. A regular SQL database would have to compare every single row one by one — completely impractical at scale.&lt;/p&gt;

&lt;p&gt;Popular tools built for this exact problem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.pinecone.io/" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Managed cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Open source / cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Lightweight / local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;pgvector&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Postgres extension&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;RAG is the practical answer to the question: &lt;em&gt;"How do I give an LLM access to my knowledge base without it being slow, expensive, or hallucinating?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The key insight is that &lt;strong&gt;retrieval and generation are separate concerns&lt;/strong&gt; — get retrieval right first, and the generation almost takes care of itself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this useful? Drop a ❤️ or share it with someone building LLM-powered apps.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
