<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Md Arsalan Arshad</title>
    <description>The latest articles on DEV Community by Md Arsalan Arshad (@arsalan_ai).</description>
    <link>https://dev.to/arsalan_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3835276%2F8feb42f9-3d6e-4d3f-bd64-c2997b30ce95.png</url>
      <title>DEV Community: Md Arsalan Arshad</title>
      <link>https://dev.to/arsalan_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arsalan_ai"/>
    <language>en</language>
    <item>
      <title>Why I Separated the Indexing and Query Pipelines — And What Happened the One Time I Didn't</title>
      <dc:creator>Md Arsalan Arshad</dc:creator>
      <pubDate>Thu, 02 Apr 2026 10:23:11 +0000</pubDate>
      <link>https://dev.to/arsalan_ai/why-i-separated-the-indexing-and-query-pipelines-and-what-happened-the-one-time-i-didnt-4en</link>
      <guid>https://dev.to/arsalan_ai/why-i-separated-the-indexing-and-query-pipelines-and-what-happened-the-one-time-i-didnt-4en</guid>
      <description>&lt;p&gt;I was testing LocusLab, a multi-tenant AI agent platform I am building, and something was off with the latency. Not consistently off. Sometimes the agent would reply in under 2 seconds, sometimes it would take 9 or 10. No errors in the logs. No timeouts. Just this random unpredictable delay that made no sense.&lt;/p&gt;

&lt;p&gt;My first instinct was the queue. Messages piling up maybe? I checked. Queue depth was fine. Then I thought it was the webhook, maybe the DM events were arriving late. Also fine. Then I thought the LLM calls were inconsistent so I started logging every stage of the pipeline individually. Still couldn't find it.&lt;/p&gt;

&lt;p&gt;It took me 3 days to find the actual cause.&lt;/p&gt;

&lt;p&gt;Both my indexing pipeline and query pipeline were running inside the same Lambda function. So when a user uploaded documents and indexing started, all the heavy work like preprocessing the documents, splitting them into chunks, generating embeddings, storing them in the vector database, it was all consuming the same compute and memory that the query side needed to reply to messages. The indexing would finish, resources would free up, and query latency would drop back to normal. Then someone uploads another document, indexing kicks off again, latency spikes. That is why it looked random. It was not random at all. It was completely tied to whenever indexing was happening in the background.&lt;/p&gt;

&lt;p&gt;The moment I realised this I felt stupid. Because I knew this was the right architecture from the start and I still did not do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why These Two Things Cannot Live Together
&lt;/h2&gt;

&lt;p&gt;Before getting into what I changed, it helps to understand what these two pipelines actually are and why they are so different.&lt;/p&gt;

&lt;p&gt;Think of it this way. The indexing pipeline is the librarian organising books in the background. Nobody is standing there waiting for each book to be placed on the shelf. It can take its time. The more it batches together, the more efficient it becomes. A document taking 2 minutes to fully index is completely fine because no user is waiting on the other side.&lt;/p&gt;

&lt;p&gt;The query pipeline is the librarian answering a question from someone standing at the desk. That person is waiting right now. Every second feels long. You need to find the answer as fast as possible and get back to them.&lt;/p&gt;

&lt;p&gt;When you put both of these in the same function, the librarian is trying to organise shelves and answer questions at the same time. The person at the desk keeps waiting because the librarian is busy in the back.&lt;/p&gt;

&lt;p&gt;More technically, the indexing pipeline is optimised for throughput. You want to process as many documents as possible and batching embedding calls makes them cheaper. The query pipeline is optimized for latency. You want to get below 2 seconds end to end, run searches in parallel, check the cache first so you can skip the whole pipeline on repeated questions. These two goals fight each other when they share the same compute.&lt;/p&gt;

&lt;p&gt;And the frustrating part is the failure is invisible. No errors, no crashes, just inconsistent latency that looks like 10 different problems before you find the real one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc11zhq9r0vwe8f2tv0pm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc11zhq9r0vwe8f2tv0pm.png" alt=" " width="800" height="619"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Changed
&lt;/h2&gt;

&lt;p&gt;I split them into two separate Lambda functions with an SQS queue connecting them.&lt;/p&gt;

&lt;p&gt;The indexing pipeline is now its own Lambda that gets triggered by the SQS queue. When someone uploads documents we push a job to the queue and immediately tell the user their upload was received. The Lambda picks it up in the background and handles everything, figuring out what type of document it is, extracting the text, splitting it into chunks that make sense for retrieval, generating embeddings, storing everything in VectorDB. The user is not waiting for any of this. If it is slow it does not matter. If it fails the message stays in the queue and retries automatically. Failed jobs go to a dead letter queue so nothing silently disappears.&lt;/p&gt;

&lt;p&gt;The query pipeline is a separate Lambda. When a message comes in it handles the full retrieval flow. It checks the cache first because a cache hit means you can respond in under 50ms without running any search at all. If it is a cache miss it runs vector search and keyword search at the same time in parallel rather than one after the other, then combines the results, picks the most relevant chunks, builds the context, calls the LLM, and returns the response. This function has no idea the indexing Lambda exists.&lt;/p&gt;

&lt;p&gt;The two functions share only two things. The VectorDB index where vectors are stored and a DynamoDB table that tracks document and chunk metadata. That is the only connection between them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Got Better
&lt;/h2&gt;

&lt;p&gt;Query latency dropped and stayed consistent. I ran the same test that originally broke things, a large document upload triggering full indexing, while at the same time hitting the query side with multiple messages. Latency did not move.&lt;/p&gt;

&lt;p&gt;Debugging also became much simpler and I did not expect this part. &lt;/p&gt;

&lt;p&gt;Before the split every investigation started with figuring out what else was happening in the function at that exact moment. After the split that question became irrelevant. If a query is slow the problem is in the query Lambda. If a document fails to index the problem is in the indexing Lambda. They fail separately and it is obvious where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Part
&lt;/h2&gt;

&lt;p&gt;I knew this was the right architecture before I started building. Separate the pipelines, queue between them, indexing runs async in the background. I have read enough about system design to know this.&lt;br&gt;
But I told myself just for now, just to move fast and see the agent working end to end, I will keep them together and fix it later. That was the plan.&lt;/p&gt;

&lt;p&gt;Then I spent 3 days debugging a problem that should not have existed.&lt;/p&gt;

&lt;p&gt;The agent quality was actually good during all of this. The retrieval was working, the responses were accurate, the Shopify integration was pulling the right products. All of that was fine. The only thing hurting the user experience was an architecture shortcut I took on day one that I knew was wrong.&lt;/p&gt;

&lt;p&gt;That is the part that still bothers me. It was not a hard problem. It was a known problem I chose to defer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I Am Now and What Comes Next
&lt;/h2&gt;

&lt;p&gt;I am still on Lambda for both pipelines. At my current scale it works fine and Lambda is honestly a good fit for early stage products. It scales automatically, you only pay for what you use, and there is no infrastructure to manage.&lt;/p&gt;

&lt;p&gt;The two real limitations I am aware of are cold starts and the 15 minute execution limit. Cold starts add latency when a function has not been called recently which matters a lot for the query side. The 15 minute limit means very large document processing jobs need to be broken into smaller pieces so they do not hit the ceiling.&lt;/p&gt;

&lt;p&gt;When traffic grows to the point where I need a constantly warm query function and more control over how long indexing jobs can run, I will move to ECS. But that is a future problem. The separation itself is what mattered, not which compute service I used to do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Keeping Them Together Is Actually Fine
&lt;/h2&gt;

&lt;p&gt;If you are building a prototype, single tenant, small number of documents, no real users yet, keep them together. The overhead of managing two functions, a queue between them, and separate monitoring is not worth it when you are just trying to see if the product idea works at all.&lt;/p&gt;

&lt;p&gt;The moment you have users uploading documents while other users are querying at the same time, separate them. That is the line. Not because of some rule, but because that is exactly when one pipeline starts silently hurting the other one.&lt;/p&gt;

&lt;p&gt;Do not wait for 3 days of confused debugging to make the call.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>RAG Components Explained: The Building Blocks of Modern AI</title>
      <dc:creator>Md Arsalan Arshad</dc:creator>
      <pubDate>Wed, 25 Mar 2026 13:46:08 +0000</pubDate>
      <link>https://dev.to/arsalan_ai/rag-components-explained-the-building-blocks-of-modern-ai-5176</link>
      <guid>https://dev.to/arsalan_ai/rag-components-explained-the-building-blocks-of-modern-ai-5176</guid>
      <description>&lt;p&gt;Retrieval-Augmented Generation (RAG) is one of the most powerful techniques for making Large Language Models (LLMs) smarter, more factual, and more up-to-date. Instead of relying only on what an LLM was trained on, RAG retrieves relevant external information first and then asks the model to generate an answer based on that information.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll break down the core components of RAG, step by step, in a simple and practical way. By the end, you’ll have a clear mental model of how RAG works and why each component matters.&lt;/p&gt;

&lt;p&gt;RAG is not a single model — it’s a pipeline of steps. Here are the detailed building blocks:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpeivdxvi8xt0inifgvj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpeivdxvi8xt0inifgvj.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Document Loader
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Document Loader?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Document Loader is a component that reads data from files or sources and converts it into a format your AI model can understand and process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Document Loaders are Important?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Most AI models, especially LLMs, only understand text, not raw PDFs, Excel files, images, or websites.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Document loaders standardize and normalize the input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;They make your data searchable, chunkable, and usable for embeddings or question-answering pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Imagine asking someone to summarize a book. If the book is in a messy stack of scanned pages, it’s impossible. But if all pages are typed out and cleaned, it’s easy. Document loaders do this “cleaning and typing out” automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Document Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document loaders can be classified by source type:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Local File Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples: TextLoader, PDFLoader, CSVLoader, DocxLoader&lt;br&gt;
Reads files from your computer or server.&lt;br&gt;
Handles different file formats and converts them to plain text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Web Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples: WebBaseLoader, SitemapLoader&lt;br&gt;
Fetch data from web pages, APIs, or RSS feeds.&lt;br&gt;
Often includes cleaning HTML tags, scripts, or ads before processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Cloud Storage Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples: GoogleDriveLoader, S3Loader, NotionLoader&lt;br&gt;
Load documents from cloud services (Google Drive, AWS S3, Notion, Confluence).&lt;br&gt;
Often requires authentication keys or APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Database Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples: SQLDatabaseLoader, MongoDBLoader&lt;br&gt;
Load data from structured databases (tables, queries).&lt;br&gt;
Converts database rows into textual documents for NLP models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features of Document Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parsing — Understand the structure of the source (PDF pages, CSV rows).&lt;/li&gt;
&lt;li&gt;Cleaning — Remove noise like HTML tags, whitespace, or irrelevant metadata.&lt;/li&gt;
&lt;li&gt;Splitting/Chunking — Large documents are broken into smaller, model-friendly chunks.&lt;/li&gt;
&lt;li&gt;Metadata Extraction — Keep info like file name, page number, or URL for traceability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Subtopics / Advanced Concepts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Recursive Document Loading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Breaks nested structures (PDF with multiple sections, Word with headings).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Useful when you need fine-grained context.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like peeling an onion layer by layer to get every bit of flavor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Combining Multiple Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LangChain allows combining loaders to handle multiple formats in one pipeline.
Example: Load PDFs + CSVs + Webpages together for a knowledge base.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Blending apples, bananas, and strawberries together for a multi-fruit smoothie.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Custom Loaders&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can write a custom loader if your data source is unique (e.g., a proprietary ERP system).
Key methods: load() returns a list of Document objects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Designing your own custom blender attachment for a special fruit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document Object Structure (LangChain)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every loader returns a Document object with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;page content → The text content.&lt;/li&gt;
&lt;li&gt;metadata → Dictionary with extra info (file name, source, page number).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can say:&lt;/p&gt;

&lt;p&gt;“Document loaders normalize and convert raw sources into Document objects with text and metadata, enabling LLMs to process heterogeneous sources efficiently.”&lt;/p&gt;
&lt;h2&gt;
  
  
  Text Splitter
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Text Splitter?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Text Splitter is a tool that breaks a large document into smaller, manageable pieces (chunks) so that a language model can process them efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Imagine you have a giant pizza . Your mouth (LLM) can’t eat the whole pizza at once, so you cut it into slices. Each slice = one chunk of text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Do We Need Text Splitters?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLMs have a context window limit (e.g., GPT-4 can handle ~128k tokens max). If your document is too big, it won’t fit.&lt;/li&gt;
&lt;li&gt;Splitting ensures:&lt;/li&gt;
&lt;li&gt;The model doesn’t forget earlier parts.&lt;/li&gt;
&lt;li&gt;You can store chunks in a vector database for retrieval.&lt;/li&gt;
&lt;li&gt;Queries can be answered with relevant small pieces, not the entire document.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Think of studying a textbook. Instead of memorizing the whole 500 pages at once, you break it into chapters and sections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types of Text Splitters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are multiple strategies for splitting text. Let’s go through them one by one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Character/Text Splitter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Splits text by character count.&lt;/li&gt;
&lt;li&gt;Example: Split every 1000 characters.&lt;/li&gt;
&lt;li&gt;Simple but may cut sentences awkwardly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like chopping a book into pieces of 10 pages each, without caring if a chapter gets cut mid-way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Recursive Character/Text Splitter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smarter version → tries to split by logical boundaries first (paragraphs → sentences → words → characters).&lt;/li&gt;
&lt;li&gt;Ensures chunks are readable and meaningful.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like first cutting a cake by layers, then into slices, and finally into bites, if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Token Splitter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Splits based on tokens (subword units) instead of characters.&lt;/li&gt;
&lt;li&gt;Useful because LLMs process tokens, not raw characters.&lt;/li&gt;
&lt;li&gt;Prevents exceeding model’s token limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like breaking a speech into words, not random letters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;d. Semantic/Embedding-Based Splitter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses embeddings to split text into semantically coherent chunks.&lt;/li&gt;
&lt;li&gt;Ensures each chunk makes sense contextually, not just by size.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like splitting a documentary into topics rather than every 10 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;e. Markdown / Code-Aware Splitter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specialized splitters for Markdown docs, programming code, or XML.&lt;/li&gt;
&lt;li&gt;Preserves structure (e.g., headers, code blocks).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like cutting a programming tutorial while keeping each function intact instead of cutting mid-function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important Parameters of Text Splitters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Chunk Size&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Maximum length of a chunk (e.g., 500 tokens).&lt;/p&gt;

&lt;p&gt;· Too small = loses context. Too large = may exceed model limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Chunk Overlap&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Extra tokens carried over between chunks.&lt;/p&gt;

&lt;p&gt;· Prevents context loss at boundaries.&lt;/p&gt;

&lt;p&gt;· Example: If chunk size = 500 and overlap = 50 → chunk 1 = 1–500, chunk 2 = 451–950.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Subtopics / Advanced Concepts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sliding Window Splitting → Moves through the text with a window (like overlap but continuous).&lt;/li&gt;
&lt;li&gt;Hierarchical Splitting → First split into sections, then split further if needed.&lt;/li&gt;
&lt;li&gt;Hybrid Splitters → Combine character + semantic strategies for optimal balance.&lt;/li&gt;
&lt;li&gt;Adaptive Splitting → Dynamically adjusts chunk size based on document complexity.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Embedding
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is an Embedding in RAG?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Definition:&lt;/strong&gt;&lt;br&gt;
An embedding is a way to represent text (words, sentences, documents) as numbers in a multi-dimensional space, so that similar things are close together and different things are far apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Imagine a huge library with books in different languages. Instead of relying on book titles, we tag each book with a unique scent. Similar books (like two cooking books) have similar scents, so when you sniff one, you can easily find others nearby.&lt;br&gt;
In RAG, embeddings are that “scent” that helps us find the most relevant text chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why are Embeddings Important in RAG?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RAG = Retrieval-Augmented Generation&lt;br&gt;
The workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;User asks a question.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;System retrieves relevant knowledge from a large knowledge base.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;LLM augments answer using this retrieved context.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without embeddings → retrieval would just be keyword search (Google-style).&lt;br&gt;
With embeddings → we capture semantic meaning (even if the exact words differ).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query: “What is the capital of France?”&lt;/li&gt;
&lt;li&gt;Keyword search → looks for exact word “capital” + “France”.&lt;/li&gt;
&lt;li&gt;Embedding search → knows “capital city” ≈ “capital”, and retrieves “Paris is the capital city of France.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How Embeddings are Used in RAG&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Convert documents into embeddings&lt;/strong&gt; (vector representations).&lt;/p&gt;

&lt;p&gt;· Each document/chunk is mapped into a vector space (e.g., 1536 dimensions for OpenAI embeddings).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Store embeddings in a Vector Database&lt;/strong&gt; (like Pinecone, Weaviate, FAISS, ChromaDB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Convert query into embedding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· User’s question is transformed into the same vector space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Retrieve nearest neighbors.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Vector similarity (cosine similarity, dot product, etc.) finds closest documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Feed them into LLM for augmented generation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Different Types of Embeddings in RAG&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Depending on the use-case, embeddings can be:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Word embeddings&lt;/strong&gt; (old-school, e.g., Word2Vec, GloVe) → represent individual words.&lt;/p&gt;

&lt;p&gt;· Problem: “bank” (river bank vs. financial bank) gets one meaning only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Sentence/Document embeddings&lt;/strong&gt; (modern, e.g., OpenAI Ada-002, Sentence-BERT) → represent whole sentences/documents, capturing context.&lt;/p&gt;

&lt;p&gt;· Better for retrieval since queries are often sentence-level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Multimodal embeddings&lt;/strong&gt; → represent text + images/audio in same space (e.g., CLIP).&lt;/p&gt;

&lt;p&gt;· Useful when your knowledge base includes PDFs with images, diagrams, or speech transcripts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Concepts in Embeddings for RAG&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Vector Dimension:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· How many numbers represent a text (e.g., 512, 768, 1536, 4096).&lt;/p&gt;

&lt;p&gt;· Higher = more expressive but also heavier storage &amp;amp; compute.&lt;/p&gt;

&lt;p&gt;· Example: OpenAI text-embedding-3-small = 1536 dims, text-embedding-3-large = 3072 dims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Similarity Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Cosine similarity → measures angle (most common).&lt;/p&gt;

&lt;p&gt;· Dot product → raw alignment.&lt;/p&gt;

&lt;p&gt;· Euclidean distance → physical distance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Chunking + Embedding:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· You don’t embed a whole 100-page PDF at once → too big &amp;amp; unhelpful.&lt;/p&gt;

&lt;p&gt;· Instead, split into chunks (Text Splitter), embed each chunk, and retrieve chunk-wise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Dynamic Embedding Updates:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· In live systems, embeddings must be updated as knowledge base grows (e.g., new wiki pages).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Suppose we have a knowledge base of research papers.&lt;/li&gt;
&lt;li&gt;I ask: “What are applications of transformers in healthcare?”&lt;/li&gt;
&lt;li&gt;Embedding of my query is compared to embeddings of document chunks.&lt;/li&gt;
&lt;li&gt;Even if a chunk says “transformer models are widely applied in medical diagnosis”, embeddings ensure it’s retrieved.&lt;/li&gt;
&lt;li&gt;LLM then reads it and gives me a factually grounded answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without embeddings → system might miss it because it didn’t contain the exact keyword “healthcare”.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Embeddings are the backbone of retrieval in RAG. They transform text into dense vectors in high-dimensional space so semantically similar texts are close together. When a user query comes, we embed it, search for nearest vectors in a vector store, and feed those chunks into the LLM for answer generation. This allows retrieval to go beyond exact keyword matching and capture true meaning. For example, if I search for ‘doctor salary’, embeddings will still retrieve documents that mention ‘physician income’ because the semantic meaning is similar.”&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Vector Store
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Vector Store?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Vector Store is a special database that stores embeddings (numerical vectors) of your text (or images, audio, etc.) and lets you quickly search for the most similar items.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Imagine a giant library, but instead of organizing books by title or author, it organizes them by meaning.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You ask, “Tell me about renewable energy?”&lt;/li&gt;
&lt;li&gt;The librarian doesn’t just look for the word “renewable” → they check books with similar meaning (solar, wind, green power).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s what a vector store does: semantic search instead of keyword search.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Do We Need Vector Stores?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs are stateless and can’t remember all documents.&lt;/p&gt;

&lt;p&gt;If you want an LLM to answer from your custom data (e.g., PDFs, databases), you need:&lt;/p&gt;

&lt;p&gt;· Convert text → embeddings (numerical meaning).&lt;/p&gt;

&lt;p&gt;· Store embeddings in a vector store.&lt;/p&gt;

&lt;p&gt;· At query time → find nearest embeddings to your question.&lt;/p&gt;

&lt;p&gt;· Give those chunks to the LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Think of Google Search:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional search = keyword match.&lt;/li&gt;
&lt;li&gt;Vector search = “understands what you mean” and finds the best matches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Core Workflow of Vector Stores&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Input Data (from loaders + splitters)&lt;/strong&gt;&lt;br&gt;
→ text chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Embedding Model&lt;/strong&gt;&lt;br&gt;
→ converts each chunk into a high-dimensional vector (like a unique fingerprint).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Vector Store&lt;/strong&gt;&lt;br&gt;
→ saves these embeddings in a searchable structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Similarity Search&lt;/strong&gt;&lt;br&gt;
→ when a query comes, it is embedded too → compare with stored embeddings → return top-k most similar chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features of Vector Stores&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Similarity Search (find nearest embeddings).&lt;/li&gt;
&lt;li&gt;kNN (k-Nearest Neighbors) search → return top-k similar results.&lt;/li&gt;
&lt;li&gt;Cosine Similarity / Euclidean Distance → measures closeness.&lt;/li&gt;
&lt;li&gt;Filtering with Metadata → search by both meaning + filters (e.g., “all finance docs after 2020”).&lt;/li&gt;
&lt;li&gt;Hybrid Search → combine keyword + semantic search.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Popular Vector Stores (Examples)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FAISS (Facebook AI) → Lightweight, open-source, runs locally.&lt;/li&gt;
&lt;li&gt;ChromaDB → Popular in LangChain, easy integration.&lt;/li&gt;
&lt;li&gt;Weaviate → Cloud-native, supports hybrid search.&lt;/li&gt;
&lt;li&gt;Pinecone → Fully managed, scalable SaaS solution.&lt;/li&gt;
&lt;li&gt;Milvus → Open-source, enterprise-level.&lt;/li&gt;
&lt;li&gt;Elasticsearch / OpenSearch → Traditional search engines, now support vectors too.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Subtopics / Advanced Concepts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Embedding Dimension&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each vector has, say, 384 / 768 / 1536 dimensions depending on the embedding model.&lt;/li&gt;
&lt;li&gt;Higher dimensions capture more nuance but are heavier to compute.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogy:&lt;br&gt;
Like a fingerprint scanner → more features captured = more accurate match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;b. Indexing Methods&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flat index → brute force (slow for large datasets).&lt;/li&gt;
&lt;li&gt;Approximate Nearest Neighbor (ANN) → faster search with small accuracy trade-off.&lt;/li&gt;
&lt;li&gt;HNSW (Hierarchical Navigable Small World Graph) → popular fast ANN algorithm.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Analogy:&lt;br&gt;
Flat = checking every student’s exam paper one by one.&lt;br&gt;
HNSW = grouping students by similarity first, then checking fewer papers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;c. Persistence vs In-Memory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some stores keep vectors in memory (fast, but temporary).&lt;/li&gt;
&lt;li&gt;Others persist on disk / cloud (permanent storage).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;d. Filtering &amp;amp; Metadata&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store extra info (doc title, source, timestamp) alongside vectors.&lt;/li&gt;
&lt;li&gt;Enables queries like: “Give me legal documents about AI after 2021.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;e. Hybrid Search&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Combines keyword search (BM25, TF-IDF) with vector search.&lt;/li&gt;
&lt;li&gt;Useful when you want both semantic meaning + exact keyword match.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Retriever
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is a Retriever?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Retriever is a component that fetches the most relevant information from a knowledge base (like a vector store) when you ask a question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Imagine a librarian in a library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The vector store = the whole library.&lt;/li&gt;
&lt;li&gt;The retriever = the librarian who quickly finds the most relevant books or passages for your query.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Do We Need Retrievers?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector stores can store millions of embeddings.&lt;/li&gt;
&lt;li&gt;You don’t want all documents → just the top-k most relevant ones.&lt;/li&gt;
&lt;li&gt;Retrievers sit between the database (vector store) and the model (LLM), ensuring only the best context is passed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Google search →&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Database = the entire web.&lt;/li&gt;
&lt;li&gt;Retriever = the ranking engine that picks the top 10 results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Workflow of a Retriever (Step-by-Step)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Input Query&lt;/strong&gt; → User asks a question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Embed the Query&lt;/strong&gt; → Convert it into a vector using the same embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Search in Vector Store&lt;/strong&gt; → Compare query vector with stored vectors (chunks).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Retrieve Top-k Results&lt;/strong&gt; → Return most similar documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Pass to LLM&lt;/strong&gt; → LLM uses them to generate the final answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analogy:&lt;/strong&gt;&lt;br&gt;
Like asking a librarian:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You → “Tell me about AI in healthcare.”&lt;/li&gt;
&lt;li&gt;Librarian → Finds the top 3 books/chapters.&lt;/li&gt;
&lt;li&gt;You → Read them and answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Types of Retrievers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can categorize retrievers in different ways depending on how they fetch information:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Based on Data Source:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How/where the retriever is pulling the information from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Vector Store Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Uses embeddings + vector similarity (e.g., cosine similarity, dot product).&lt;/p&gt;

&lt;p&gt;· Example: FAISS, Pinecone, Weaviate retrievers.&lt;/p&gt;

&lt;p&gt;· Analogy: “You hum a song, and Shazam finds the closest match.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Keyword Retriever (Sparse Retriever)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Uses traditional text search (TF-IDF, BM25, ElasticSearch).&lt;/p&gt;

&lt;p&gt;· Analogy: “You search exact words in Google using quotes.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Hybrid Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Combines Vector + Keyword for best of both worlds.&lt;/p&gt;

&lt;p&gt;· Analogy: “You tell librarian: I want books that both contain the word ‘Einstein’ and sound similar to relativity topics.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Database/API Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Retrieves directly from structured databases or APIs (SQL, Mongo, REST calls).&lt;/p&gt;

&lt;p&gt;· Analogy: “Instead of asking the librarian, you ask the database directly: ‘Give me all employees where age &amp;gt; 40’.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Based on Retrieval Strategy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How the retriever decides what to return.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Similarity Search Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Finds top-k most similar chunks.&lt;/p&gt;

&lt;p&gt;· Example: “Give me top 5 most similar passages to this query.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. MMR Retriever (Maximal Marginal Relevance)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Balances relevance and diversity.&lt;/p&gt;

&lt;p&gt;· Prevents redundancy (you don’t get the same idea phrased slightly differently).&lt;/p&gt;

&lt;p&gt;· Analogy: Instead of giving you 5 books all about Einstein’s life, it gives you 1 about his life, 1 about relativity, 1 about Nobel prize, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Self-Query Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Uses an LLM to rewrite your query into structured filters.&lt;/p&gt;

&lt;p&gt;· Example: You ask: “Show me Tesla articles after 2021.”&lt;/p&gt;

&lt;p&gt;§ LLM translates → { company: Tesla, date &amp;gt; 2021 }.&lt;/p&gt;

&lt;p&gt;· Analogy: “You ask the librarian in English, they translate it into a library catalog search.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Contextual Compression Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Retrieves a lot, then compresses/summarizes before returning.&lt;/p&gt;

&lt;p&gt;· Useful when documents are long.&lt;/p&gt;

&lt;p&gt;· Analogy: “Librarian brings you 10 books, but highlights only the most useful paragraphs.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Multi-Vector Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Stores multiple embeddings per document (e.g., for summary, keywords, title).&lt;/p&gt;

&lt;p&gt;· Helps retrieve by different “views” of the same doc.&lt;/p&gt;

&lt;p&gt;· Analogy: Instead of indexing a movie only by title, you also index by actors, genre, plot summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Parent Document Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Splits into small chunks for retrieval but returns the entire parent doc for context.&lt;/p&gt;

&lt;p&gt;· Analogy: Librarian finds a single paragraph, then gives you the whole book it came from.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Based on Task/Use-Case:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Specialized retrievers for certain workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Time/Recency-Based Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Retrieves most recent docs first (important for news, finance).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Multi-Modal Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Can retrieve not just text, but also images, audio, code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Knowledge Graph Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Retrieves from graph structures (entities + relationships).&lt;/p&gt;

&lt;p&gt;· Example: “Find me all drugs that interact with Aspirin.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Ensemble Retriever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;· Combines outputs from multiple retrievers (e.g., vector + BM25 + KG).&lt;/p&gt;

&lt;p&gt;· Then merges results (e.g., rank fusion).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Retrievers can be categorized in multiple ways: by source, by retrieval strategy, and by use case. For example, the most common is vector store retriever, but for avoiding redundancy we use MMR. For filtering, we can use self-query. For large docs, parent-document retriever works best.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retriever in a RAG Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Full pipeline flow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Document Loader&lt;/strong&gt; → Reads data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Text Splitter&lt;/strong&gt; → Creates chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Embeddings&lt;/strong&gt; → Converts chunks to vectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Vector Store&lt;/strong&gt; → Stores embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Retriever&lt;/strong&gt; → Fetches top-k relevant chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6.LLM&lt;/strong&gt; → Generates answer using retrieved context.&lt;/p&gt;

&lt;p&gt;📄 Document Loader&lt;br&gt;
↓&lt;br&gt;
✂️ Text Splitter&lt;br&gt;
↓&lt;br&gt;
🔢 Embeddings → 🗄 Vector Store&lt;br&gt;
↓&lt;br&gt;
🔍 Retriever (Top-K results)&lt;br&gt;
↓&lt;br&gt;
🤖 LLM (Prompt + Context)&lt;br&gt;
↓&lt;br&gt;
📝 Final Answer&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Code Example (with LangChain)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple RAG Example using LangChain
# Make sure you install: pip install langchain openai chromadb
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;

&lt;span class="c1"&gt;# 1️⃣ Load Documents
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TextLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_policies.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with your file
&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 2️⃣ Split Text into Chunks
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3️⃣ Create Embeddings + Store in Vector DB
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Requires OPENAI_API_KEY in env
&lt;/span&gt;&lt;span class="n"&gt;vectorstore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4️⃣ Build Retriever
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# 5️⃣ Build RAG Chain
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;qa_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_chain_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_source_documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 6️⃣ Ask a Question
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is our company&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s remote work policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Sources:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RAG isn’t just a buzzword — it’s a powerful design pattern that makes LLMs practical for real-world applications. Understanding its components is the first step toward building production-ready AI assistants, chatbots, and knowledge systems.&lt;/p&gt;

&lt;p&gt;In future blogs, we’ll dive deeper into retrieval strategies, prompt engineering, and evaluation techniques for RAG systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
