DEV Community: Md Arsalan Arshad

When to Use an AI Agent and When Not To

Md Arsalan Arshad — Mon, 01 Jun 2026 19:07:49 +0000

Building an AI agent when you do not need one is like hiring a team of 10 people for a job one person can do.

Most engineers I see make this mistake in one of two ways.

Either they reach for an agent immediately, before they have even thought about whether the task needs one. Or they stick to chains for
everything because agents feel complex and unpredictable.

Both are wrong. And the cost of getting this wrong is real.

What is actually happening under the hood

An agent runs a minimum of 3 to 10 LLM calls per task. Often more. Each one costs tokens, adds latency, and introduces a point of failure. A chain runs one LLM call and moves on.

If your task has a fixed, predictable path, you just paid 10x for nothing. And you made your system harder to debug in production because now you have 10 things that could have gone wrong instead of one.

But if your task requires the system to look at what it just found and decide what to do next, a chain physically cannot do that. You have to hardcode every possible path upfront. Which means you are not building a chain. You are building a maze with no exits.

So the question is not "should I use an agent?" The question is one thing: does the next step depend on what the previous step returned?

If yes, you need an agent. The path cannot be defined upfront.

If no, use a chain. Same steps, same order, every time. Faster, cheaper, and fully deterministic.

A concrete example so this is not just theory

Say you are building a system that answers questions about Apple's Q3 earnings.

Step one: search for "Q3 2024 earnings Apple". Step two: fetch the top result. Step three: extract revenue numbers. Step four: format as a table.

Same path every run. Every single time. You can write this in 20 lines of Python without any agent infrastructure at all. This is a chain.

Now take a different task. You want the system to research Apple's recent financial performance, but you do not know upfront whether it will find a press release, a news article, or a data source. If it finds a news article, it needs to search for the primary source. If it finds data, it checks whether confidence is high enough or needs cross-referencing. If the numbers conflict across sources, it needs to reconcile them.

You cannot write that decision logic before you see what the search returns. The path through step 2, 3, and 4 depends entirely on what step 1 gave back. This is an agent.

The structure of the task determines which one you need. Not your preference, not what feels more impressive, not what you learned about last week.

The one question to ask before you build anything

Before you decide, run through this mentally:

Can I define the complete execution path right now, before the task runs?

If yes, write a chain. You know every step, every input, every output. There is no dynamic decision to make. An agent adds nothing here except cost and complexity.

Does the task need to look at intermediate results and decide what happens next?

If yes, you need an agent. The path is dynamic. It cannot be fully specified until the task is actually running and producing observations.

Does the task need to pick from multiple tools based on context?

If no, a chain with a fixed set of steps is the right call. Agents are designed for situations where the choice of tool itself is a decision that needs to be made at runtime.

What is the latency requirement?

Agents typically take 5 to 30 seconds for multi-step tasks. A single LLM call takes 0.5 to 2 seconds. If someone is waiting for a response in real time, this matters a lot.

What is the volume?

At 100,000 calls a day, an agent running 3x the LLM calls of a chain at potentially higher token counts can cost 9 times more. That difference is not theoretical. It shows up in your AWS bill at the end of the month.

How this played out in LocusLab

For LocusLab, this decision was clear from day one.

A customer sends an Instagram DM asking what hoodies are in stock. The system searches the Shopify catalog, finds the products, and returns a response. Same sequence every time. That is a chain.

A customer sends a DM asking about their order. The system checks the order status, gets back a result, and then has to decide: is this a delivery issue, a product issue, or does this need a human? Each of those paths triggers different tools and different responses. You cannot write that decision before you see what the order lookup returns. That is an agent.

Two tasks. Same product. Completely different infrastructure. Because the structure of the problem is different, not because one feels more advanced than the other.

Where most people go wrong

The mistake is not choosing agents over chains. The mistake is choosing based on anything other than the structure of the task.

Engineers reach for agents because they want to build something sophisticated. Or they avoid them because they had one go wrong in production and it spooked them. Both of these are the wrong reason.

The engineers who get this right are not the ones who know how to build agents. They are the ones who know when not to.

The short version

Use a chain when the path is fixed, the steps are predictable, and you can define everything upfront. It will be faster, cheaper, and easier to debug.

Use an agent when the next step genuinely depends on what the previous step returned, when the task needs to pick from multiple tools dynamically, or when self-correction is required because a bad intermediate result should change what happens next.

If you can hardcode it, hardcode it. Agents are not a better version of chains. They are a different tool for a different problem.

One of the best and detailed blog on How to decide which will be the best chunking strategy as per your document

Md Arsalan Arshad — Thu, 07 May 2026 08:22:44 +0000

Standard recursive splitters failing on code

Md Ayan Arshad

May 4

I Tested Chunking on Docs, PDFs, and Code. The Winner Changed Every Time.

#discuss #ai #programming #datascience

6 min read

Why I Separated the Indexing and Query Pipelines — And What Happened the One Time I Didn't

Md Arsalan Arshad — Thu, 02 Apr 2026 10:23:11 +0000

I was testing LocusLab, a multi-tenant AI agent platform I am building, and something was off with the latency. Not consistently off. Sometimes the agent would reply in under 2 seconds, sometimes it would take 9 or 10. No errors in the logs. No timeouts. Just this random unpredictable delay that made no sense.

My first instinct was the queue. Messages piling up maybe? I checked. Queue depth was fine. Then I thought it was the webhook, maybe the DM events were arriving late. Also fine. Then I thought the LLM calls were inconsistent so I started logging every stage of the pipeline individually. Still couldn't find it.

It took me 3 days to find the actual cause.

Both my indexing pipeline and query pipeline were running inside the same Lambda function. So when a user uploaded documents and indexing started, all the heavy work like preprocessing the documents, splitting them into chunks, generating embeddings, storing them in the vector database, it was all consuming the same compute and memory that the query side needed to reply to messages. The indexing would finish, resources would free up, and query latency would drop back to normal. Then someone uploads another document, indexing kicks off again, latency spikes. That is why it looked random. It was not random at all. It was completely tied to whenever indexing was happening in the background.

The moment I realised this I felt stupid. Because I knew this was the right architecture from the start and I still did not do it.

Why These Two Things Cannot Live Together

Before getting into what I changed, it helps to understand what these two pipelines actually are and why they are so different.

Think of it this way. The indexing pipeline is the librarian organising books in the background. Nobody is standing there waiting for each book to be placed on the shelf. It can take its time. The more it batches together, the more efficient it becomes. A document taking 2 minutes to fully index is completely fine because no user is waiting on the other side.

The query pipeline is the librarian answering a question from someone standing at the desk. That person is waiting right now. Every second feels long. You need to find the answer as fast as possible and get back to them.

When you put both of these in the same function, the librarian is trying to organise shelves and answer questions at the same time. The person at the desk keeps waiting because the librarian is busy in the back.

More technically, the indexing pipeline is optimised for throughput. You want to process as many documents as possible and batching embedding calls makes them cheaper. The query pipeline is optimized for latency. You want to get below 2 seconds end to end, run searches in parallel, check the cache first so you can skip the whole pipeline on repeated questions. These two goals fight each other when they share the same compute.

And the frustrating part is the failure is invisible. No errors, no crashes, just inconsistent latency that looks like 10 different problems before you find the real one.

What I Changed

I split them into two separate Lambda functions with an SQS queue connecting them.

The indexing pipeline is now its own Lambda that gets triggered by the SQS queue. When someone uploads documents we push a job to the queue and immediately tell the user their upload was received. The Lambda picks it up in the background and handles everything, figuring out what type of document it is, extracting the text, splitting it into chunks that make sense for retrieval, generating embeddings, storing everything in VectorDB. The user is not waiting for any of this. If it is slow it does not matter. If it fails the message stays in the queue and retries automatically. Failed jobs go to a dead letter queue so nothing silently disappears.

The query pipeline is a separate Lambda. When a message comes in it handles the full retrieval flow. It checks the cache first because a cache hit means you can respond in under 50ms without running any search at all. If it is a cache miss it runs vector search and keyword search at the same time in parallel rather than one after the other, then combines the results, picks the most relevant chunks, builds the context, calls the LLM, and returns the response. This function has no idea the indexing Lambda exists.

The two functions share only two things. The VectorDB index where vectors are stored and a DynamoDB table that tracks document and chunk metadata. That is the only connection between them.

What Actually Got Better

Query latency dropped and stayed consistent. I ran the same test that originally broke things, a large document upload triggering full indexing, while at the same time hitting the query side with multiple messages. Latency did not move.

Debugging also became much simpler and I did not expect this part.

Before the split every investigation started with figuring out what else was happening in the function at that exact moment. After the split that question became irrelevant. If a query is slow the problem is in the query Lambda. If a document fails to index the problem is in the indexing Lambda. They fail separately and it is obvious where to look.

The Honest Part

I knew this was the right architecture before I started building. Separate the pipelines, queue between them, indexing runs async in the background. I have read enough about system design to know this.
But I told myself just for now, just to move fast and see the agent working end to end, I will keep them together and fix it later. That was the plan.

Then I spent 3 days debugging a problem that should not have existed.

The agent quality was actually good during all of this. The retrieval was working, the responses were accurate, the Shopify integration was pulling the right products. All of that was fine. The only thing hurting the user experience was an architecture shortcut I took on day one that I knew was wrong.

That is the part that still bothers me. It was not a hard problem. It was a known problem I chose to defer.

Where I Am Now and What Comes Next

I am still on Lambda for both pipelines. At my current scale it works fine and Lambda is honestly a good fit for early stage products. It scales automatically, you only pay for what you use, and there is no infrastructure to manage.

The two real limitations I am aware of are cold starts and the 15 minute execution limit. Cold starts add latency when a function has not been called recently which matters a lot for the query side. The 15 minute limit means very large document processing jobs need to be broken into smaller pieces so they do not hit the ceiling.

When traffic grows to the point where I need a constantly warm query function and more control over how long indexing jobs can run, I will move to ECS. But that is a future problem. The separation itself is what mattered, not which compute service I used to do it.

When Keeping Them Together Is Actually Fine

If you are building a prototype, single tenant, small number of documents, no real users yet, keep them together. The overhead of managing two functions, a queue between them, and separate monitoring is not worth it when you are just trying to see if the product idea works at all.

The moment you have users uploading documents while other users are querying at the same time, separate them. That is the line. Not because of some rule, but because that is exactly when one pipeline starts silently hurting the other one.

Do not wait for 3 days of confused debugging to make the call.

RAG Components Explained: The Building Blocks of Modern AI

Md Arsalan Arshad — Wed, 25 Mar 2026 13:46:08 +0000

Retrieval-Augmented Generation (RAG) is one of the most powerful techniques for making Large Language Models (LLMs) smarter, more factual, and more up-to-date. Instead of relying only on what an LLM was trained on, RAG retrieves relevant external information first and then asks the model to generate an answer based on that information.

In this blog, we’ll break down the core components of RAG, step by step, in a simple and practical way. By the end, you’ll have a clear mental model of how RAG works and why each component matters.

RAG is not a single model — it’s a pipeline of steps. Here are the detailed building blocks:

Document Loader

What is a Document Loader?

A Document Loader is a component that reads data from files or sources and converts it into a format your AI model can understand and process.

Why Document Loaders are Important?

Most AI models, especially LLMs, only understand text, not raw PDFs, Excel files, images, or websites.
Document loaders standardize and normalize the input.
They make your data searchable, chunkable, and usable for embeddings or question-answering pipelines.

Analogy:
Imagine asking someone to summarize a book. If the book is in a messy stack of scanned pages, it’s impossible. But if all pages are typed out and cleaned, it’s easy. Document loaders do this “cleaning and typing out” automatically.

Types of Document Loaders

Document loaders can be classified by source type:

a. Local File Loaders

Examples: TextLoader, PDFLoader, CSVLoader, DocxLoader
Reads files from your computer or server.
Handles different file formats and converts them to plain text.

b. Web Loaders

Examples: WebBaseLoader, SitemapLoader
Fetch data from web pages, APIs, or RSS feeds.
Often includes cleaning HTML tags, scripts, or ads before processing.

c. Cloud Storage Loaders

Examples: GoogleDriveLoader, S3Loader, NotionLoader
Load documents from cloud services (Google Drive, AWS S3, Notion, Confluence).
Often requires authentication keys or APIs.

d. Database Loaders

Examples: SQLDatabaseLoader, MongoDBLoader
Load data from structured databases (tables, queries).
Converts database rows into textual documents for NLP models.

Key Features of Document Loaders

Parsing — Understand the structure of the source (PDF pages, CSV rows).
Cleaning — Remove noise like HTML tags, whitespace, or irrelevant metadata.
Splitting/Chunking — Large documents are broken into smaller, model-friendly chunks.
Metadata Extraction — Keep info like file name, page number, or URL for traceability.

Subtopics / Advanced Concepts

a. Recursive Document Loading

Breaks nested structures (PDF with multiple sections, Word with headings).
Useful when you need fine-grained context.

Analogy:
Like peeling an onion layer by layer to get every bit of flavor.

b. Combining Multiple Loaders

LangChain allows combining loaders to handle multiple formats in one pipeline. Example: Load PDFs + CSVs + Webpages together for a knowledge base.

Analogy:
Blending apples, bananas, and strawberries together for a multi-fruit smoothie.

c. Custom Loaders

You can write a custom loader if your data source is unique (e.g., a proprietary ERP system). Key methods: load() returns a list of Document objects.

Analogy:
Designing your own custom blender attachment for a special fruit.

Document Object Structure (LangChain)

Every loader returns a Document object with:

page content → The text content.
metadata → Dictionary with extra info (file name, source, page number).

You can say:

“Document loaders normalize and convert raw sources into Document objects with text and metadata, enabling LLMs to process heterogeneous sources efficiently.”

Text Splitter

What is a Text Splitter?

A Text Splitter is a tool that breaks a large document into smaller, manageable pieces (chunks) so that a language model can process them efficiently.

Analogy:
Imagine you have a giant pizza . Your mouth (LLM) can’t eat the whole pizza at once, so you cut it into slices. Each slice = one chunk of text.

Why Do We Need Text Splitters?

LLMs have a context window limit (e.g., GPT-4 can handle ~128k tokens max). If your document is too big, it won’t fit.
Splitting ensures:
The model doesn’t forget earlier parts.
You can store chunks in a vector database for retrieval.
Queries can be answered with relevant small pieces, not the entire document.

Analogy:
Think of studying a textbook. Instead of memorizing the whole 500 pages at once, you break it into chapters and sections.

Types of Text Splitters

There are multiple strategies for splitting text. Let’s go through them one by one:

a. Character/Text Splitter

Splits text by character count.
Example: Split every 1000 characters.
Simple but may cut sentences awkwardly.

Analogy:
Like chopping a book into pieces of 10 pages each, without caring if a chapter gets cut mid-way.

b. Recursive Character/Text Splitter

Smarter version → tries to split by logical boundaries first (paragraphs → sentences → words → characters).
Ensures chunks are readable and meaningful.

Analogy:
Like first cutting a cake by layers, then into slices, and finally into bites, if needed.

c. Token Splitter

Splits based on tokens (subword units) instead of characters.
Useful because LLMs process tokens, not raw characters.
Prevents exceeding model’s token limit.

Analogy:
Like breaking a speech into words, not random letters.

d. Semantic/Embedding-Based Splitter

Uses embeddings to split text into semantically coherent chunks.
Ensures each chunk makes sense contextually, not just by size.

Analogy:
Like splitting a documentary into topics rather than every 10 minutes.

e. Markdown / Code-Aware Splitter

Specialized splitters for Markdown docs, programming code, or XML.
Preserves structure (e.g., headers, code blocks).

Analogy:
Like cutting a programming tutorial while keeping each function intact instead of cutting mid-function.

Important Parameters of Text Splitters

1. Chunk Size

· Maximum length of a chunk (e.g., 500 tokens).

· Too small = loses context. Too large = may exceed model limits.

2. Chunk Overlap

· Extra tokens carried over between chunks.

· Prevents context loss at boundaries.

· Example: If chunk size = 500 and overlap = 50 → chunk 1 = 1–500, chunk 2 = 451–950.

Subtopics / Advanced Concepts

Sliding Window Splitting → Moves through the text with a window (like overlap but continuous).
Hierarchical Splitting → First split into sections, then split further if needed.
Hybrid Splitters → Combine character + semantic strategies for optimal balance.
Adaptive Splitting → Dynamically adjusts chunk size based on document complexity.

Embedding

What is an Embedding in RAG?

Simple Definition:
An embedding is a way to represent text (words, sentences, documents) as numbers in a multi-dimensional space, so that similar things are close together and different things are far apart.

Analogy:
Imagine a huge library with books in different languages. Instead of relying on book titles, we tag each book with a unique scent. Similar books (like two cooking books) have similar scents, so when you sniff one, you can easily find others nearby.
In RAG, embeddings are that “scent” that helps us find the most relevant text chunks.

Why are Embeddings Important in RAG?

RAG = Retrieval-Augmented Generation
The workflow is:

User asks a question.
System retrieves relevant knowledge from a large knowledge base.
LLM augments answer using this retrieved context.

Without embeddings → retrieval would just be keyword search (Google-style).
With embeddings → we capture semantic meaning (even if the exact words differ).

Example:

Query: “What is the capital of France?”
Keyword search → looks for exact word “capital” + “France”.
Embedding search → knows “capital city” ≈ “capital”, and retrieves “Paris is the capital city of France.”

How Embeddings are Used in RAG

1. Convert documents into embeddings (vector representations).

· Each document/chunk is mapped into a vector space (e.g., 1536 dimensions for OpenAI embeddings).

2. Store embeddings in a Vector Database (like Pinecone, Weaviate, FAISS, ChromaDB).

3. Convert query into embedding.

· User’s question is transformed into the same vector space.

4. Retrieve nearest neighbors.

· Vector similarity (cosine similarity, dot product, etc.) finds closest documents.

5. Feed them into LLM for augmented generation.

Different Types of Embeddings in RAG

Depending on the use-case, embeddings can be:

1. Word embeddings (old-school, e.g., Word2Vec, GloVe) → represent individual words.

· Problem: “bank” (river bank vs. financial bank) gets one meaning only.

2. Sentence/Document embeddings (modern, e.g., OpenAI Ada-002, Sentence-BERT) → represent whole sentences/documents, capturing context.

· Better for retrieval since queries are often sentence-level.

3. Multimodal embeddings → represent text + images/audio in same space (e.g., CLIP).

· Useful when your knowledge base includes PDFs with images, diagrams, or speech transcripts.

Key Concepts in Embeddings for RAG

1. Vector Dimension:

· How many numbers represent a text (e.g., 512, 768, 1536, 4096).

· Higher = more expressive but also heavier storage & compute.

· Example: OpenAI text-embedding-3-small = 1536 dims, text-embedding-3-large = 3072 dims.

2. Similarity Metrics:

· Cosine similarity → measures angle (most common).

· Dot product → raw alignment.

· Euclidean distance → physical distance.

3. Chunking + Embedding:

· You don’t embed a whole 100-page PDF at once → too big & unhelpful.

· Instead, split into chunks (Text Splitter), embed each chunk, and retrieve chunk-wise.

4. Dynamic Embedding Updates:

· In live systems, embeddings must be updated as knowledge base grows (e.g., new wiki pages).

Example

Scenario:

Suppose we have a knowledge base of research papers.
I ask: “What are applications of transformers in healthcare?”
Embedding of my query is compared to embeddings of document chunks.
Even if a chunk says “transformer models are widely applied in medical diagnosis”, embeddings ensure it’s retrieved.
LLM then reads it and gives me a factually grounded answer.

Without embeddings → system might miss it because it didn’t contain the exact keyword “healthcare”.

“Embeddings are the backbone of retrieval in RAG. They transform text into dense vectors in high-dimensional space so semantically similar texts are close together. When a user query comes, we embed it, search for nearest vectors in a vector store, and feed those chunks into the LLM for answer generation. This allows retrieval to go beyond exact keyword matching and capture true meaning. For example, if I search for ‘doctor salary’, embeddings will still retrieve documents that mention ‘physician income’ because the semantic meaning is similar.”

Vector Store

What is a Vector Store?

A Vector Store is a special database that stores embeddings (numerical vectors) of your text (or images, audio, etc.) and lets you quickly search for the most similar items.

Analogy:
Imagine a giant library, but instead of organizing books by title or author, it organizes them by meaning.

You ask, “Tell me about renewable energy?”
The librarian doesn’t just look for the word “renewable” → they check books with similar meaning (solar, wind, green power).

That’s what a vector store does: semantic search instead of keyword search.

Why Do We Need Vector Stores?

LLMs are stateless and can’t remember all documents.

If you want an LLM to answer from your custom data (e.g., PDFs, databases), you need:

· Convert text → embeddings (numerical meaning).

· Store embeddings in a vector store.

· At query time → find nearest embeddings to your question.

· Give those chunks to the LLM.

Analogy:
Think of Google Search:

Traditional search = keyword match.
Vector search = “understands what you mean” and finds the best matches.

Core Workflow of Vector Stores

1. Input Data (from loaders + splitters)
→ text chunks.

2. Embedding Model
→ converts each chunk into a high-dimensional vector (like a unique fingerprint).

3. Vector Store
→ saves these embeddings in a searchable structure.

4. Similarity Search
→ when a query comes, it is embedded too → compare with stored embeddings → return top-k most similar chunks.

Key Features of Vector Stores

Similarity Search (find nearest embeddings).
kNN (k-Nearest Neighbors) search → return top-k similar results.
Cosine Similarity / Euclidean Distance → measures closeness.
Filtering with Metadata → search by both meaning + filters (e.g., “all finance docs after 2020”).
Hybrid Search → combine keyword + semantic search.

Popular Vector Stores (Examples)

FAISS (Facebook AI) → Lightweight, open-source, runs locally.
ChromaDB → Popular in LangChain, easy integration.
Weaviate → Cloud-native, supports hybrid search.
Pinecone → Fully managed, scalable SaaS solution.
Milvus → Open-source, enterprise-level.
Elasticsearch / OpenSearch → Traditional search engines, now support vectors too.

Subtopics / Advanced Concepts

a. Embedding Dimension

Each vector has, say, 384 / 768 / 1536 dimensions depending on the embedding model.
Higher dimensions capture more nuance but are heavier to compute.

Analogy:
Like a fingerprint scanner → more features captured = more accurate match.

b. Indexing Methods

Flat index → brute force (slow for large datasets).
Approximate Nearest Neighbor (ANN) → faster search with small accuracy trade-off.
HNSW (Hierarchical Navigable Small World Graph) → popular fast ANN algorithm.

Analogy:
Flat = checking every student’s exam paper one by one.
HNSW = grouping students by similarity first, then checking fewer papers.

c. Persistence vs In-Memory

Some stores keep vectors in memory (fast, but temporary).
Others persist on disk / cloud (permanent storage).

d. Filtering & Metadata

Store extra info (doc title, source, timestamp) alongside vectors.
Enables queries like: “Give me legal documents about AI after 2021.”

e. Hybrid Search

Combines keyword search (BM25, TF-IDF) with vector search.
Useful when you want both semantic meaning + exact keyword match.

Retriever

What is a Retriever?

A Retriever is a component that fetches the most relevant information from a knowledge base (like a vector store) when you ask a question.

Analogy:
Imagine a librarian in a library:

The vector store = the whole library.
The retriever = the librarian who quickly finds the most relevant books or passages for your query.

Why Do We Need Retrievers?

Vector stores can store millions of embeddings.
You don’t want all documents → just the top-k most relevant ones.
Retrievers sit between the database (vector store) and the model (LLM), ensuring only the best context is passed.

Analogy:
Google search →

Database = the entire web.
Retriever = the ranking engine that picks the top 10 results.

Workflow of a Retriever (Step-by-Step)

1. Input Query → User asks a question.

2. Embed the Query → Convert it into a vector using the same embedding model.

3. Search in Vector Store → Compare query vector with stored vectors (chunks).

4. Retrieve Top-k Results → Return most similar documents.

5. Pass to LLM → LLM uses them to generate the final answer.

Analogy:
Like asking a librarian:

You → “Tell me about AI in healthcare.”
Librarian → Finds the top 3 books/chapters.
You → Read them and answer.

Types of Retrievers

We can categorize retrievers in different ways depending on how they fetch information:

Based on Data Source:

How/where the retriever is pulling the information from.

1. Vector Store Retriever

· Uses embeddings + vector similarity (e.g., cosine similarity, dot product).

· Example: FAISS, Pinecone, Weaviate retrievers.

· Analogy: “You hum a song, and Shazam finds the closest match.”

2. Keyword Retriever (Sparse Retriever)

· Uses traditional text search (TF-IDF, BM25, ElasticSearch).

· Analogy: “You search exact words in Google using quotes.”

3. Hybrid Retriever

· Combines Vector + Keyword for best of both worlds.

· Analogy: “You tell librarian: I want books that both contain the word ‘Einstein’ and sound similar to relativity topics.”

4. Database/API Retriever

· Retrieves directly from structured databases or APIs (SQL, Mongo, REST calls).

· Analogy: “Instead of asking the librarian, you ask the database directly: ‘Give me all employees where age > 40’.”

Based on Retrieval Strategy:

How the retriever decides what to return.

1. Similarity Search Retriever

· Finds top-k most similar chunks.

· Example: “Give me top 5 most similar passages to this query.”

2. MMR Retriever (Maximal Marginal Relevance)

· Balances relevance and diversity.

· Prevents redundancy (you don’t get the same idea phrased slightly differently).

· Analogy: Instead of giving you 5 books all about Einstein’s life, it gives you 1 about his life, 1 about relativity, 1 about Nobel prize, etc.

3. Self-Query Retriever

· Uses an LLM to rewrite your query into structured filters.

· Example: You ask: “Show me Tesla articles after 2021.”

§ LLM translates → { company: Tesla, date > 2021 }.

· Analogy: “You ask the librarian in English, they translate it into a library catalog search.”

4. Contextual Compression Retriever

· Retrieves a lot, then compresses/summarizes before returning.

· Useful when documents are long.

· Analogy: “Librarian brings you 10 books, but highlights only the most useful paragraphs.”

5. Multi-Vector Retriever

· Stores multiple embeddings per document (e.g., for summary, keywords, title).

· Helps retrieve by different “views” of the same doc.

· Analogy: Instead of indexing a movie only by title, you also index by actors, genre, plot summary.

6. Parent Document Retriever

· Splits into small chunks for retrieval but returns the entire parent doc for context.

· Analogy: Librarian finds a single paragraph, then gives you the whole book it came from.

Based on Task/Use-Case:

Specialized retrievers for certain workflows.

1. Time/Recency-Based Retriever

· Retrieves most recent docs first (important for news, finance).

2. Multi-Modal Retriever

· Can retrieve not just text, but also images, audio, code.

3. Knowledge Graph Retriever

· Retrieves from graph structures (entities + relationships).

· Example: “Find me all drugs that interact with Aspirin.”

4. Ensemble Retriever

· Combines outputs from multiple retrievers (e.g., vector + BM25 + KG).

· Then merges results (e.g., rank fusion).

“Retrievers can be categorized in multiple ways: by source, by retrieval strategy, and by use case. For example, the most common is vector store retriever, but for avoiding redundancy we use MMR. For filtering, we can use self-query. For large docs, parent-document retriever works best.”

Retriever in a RAG Pipeline

Full pipeline flow:

1. Document Loader → Reads data.

2. Text Splitter → Creates chunks.

3. Embeddings → Converts chunks to vectors.

4. Vector Store → Stores embeddings.

5. Retriever → Fetches top-k relevant chunks.

6.LLM → Generates answer using retrieved context.

📄 Document Loader
↓
✂️ Text Splitter
↓
🔢 Embeddings → 🗄 Vector Store
↓
🔍 Retriever (Top-K results)
↓
🤖 LLM (Prompt + Context)
↓
📝 Final Answer

Simple Code Example (with LangChain)

# Simple RAG Example using LangChain
# Make sure you install: pip install langchain openai chromadb

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1️⃣ Load Documents
loader = TextLoader("company_policies.txt")  # Replace with your file
documents = loader.load()

# 2️⃣ Split Text into Chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# 3️⃣ Create Embeddings + Store in Vector DB
embeddings = OpenAIEmbeddings()  # Requires OPENAI_API_KEY in env
vectorstore = Chroma.from_documents(chunks, embeddings)

# 4️⃣ Build Retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 5️⃣ Build RAG Chain
llm = ChatOpenAI(model="gpt-4")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

# 6️⃣ Ask a Question
query = "What is our company's remote work policy?"
result = qa_chain.invoke({"query": query})

print("Answer:", result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])

Conclusion

RAG isn’t just a buzzword — it’s a powerful design pattern that makes LLMs practical for real-world applications. Understanding its components is the first step toward building production-ready AI assistants, chatbots, and knowledge systems.

In future blogs, we’ll dive deeper into retrieval strategies, prompt engineering, and evaluation techniques for RAG systems.