I Built a RAG Pipeline From Scratch and It Completely Changed How I Think About AI

Pinaki Batabyal — Sat, 16 May 2026 15:34:24 +0000

I Built a RAG Pipeline From Scratch and It Completely Changed How I Think About AI

I've been writing code for 3+ years. I thought I understood how AI worked.

I didn't.

Not until I sat down one weekend, opened a blank Node.js project, and decided to build something I'd been curious about for months — a system that could read a stack of PDFs and actually answer questions about them. In plain English. With sources.

What followed was honestly one of the most satisfying weeks of building I've ever had.

How It Started

I'd been using ChatGPT like everyone else — pasting text, asking questions, getting answers. But I kept hitting the same wall: it didn't know my documents. It couldn't read a specific PDF I had. It couldn't search across multiple files. It couldn't say "this answer is on page 12."

I knew RAG (Retrieval-Augmented Generation) was the solution. I'd read about it. I understood it conceptually.

Actually building it is a completely different thing.

The Moment It Clicked

The first time I uploaded a PDF, typed a question, and got back a precise answer — with the exact page number — I genuinely sat back and stared at the screen for a few seconds.

Not because it was magic. But because I understood every single step that produced that answer. I wrote every line. I knew why it worked.

That feeling is hard to describe. It's different from using a library or calling an API. This was mine, end to end.

Here's the architecture I landed on:

User types a question
    ↓
Embed the question (OpenAI text-embedding-3-small)
    ↓
Find the most similar chunks in the database (pgvector)
    ↓
Feed those chunks into GPT-4o-mini
    ↓
Get a precise, grounded answer

Four steps. Deceptively simple on paper. Deeply interesting to build.

What I Learned Building Each Layer

Chunking is harder than it sounds

My first attempt: split text every 500 characters. Done.

The results were awful. Sentences got cut in half. Context got destroyed. The model would retrieve a chunk that started mid-sentence and couldn't make sense of it.

The fix was breaking on sentence boundaries with overlap:

function chunkText(text, chunkSize = 500, overlap = 50) {
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    let end = start + chunkSize;

    // Don't cut mid-sentence — find the nearest period
    if (end < text.length) {
      const breakPoint = text.lastIndexOf('.', end);
      if (breakPoint > start + chunkSize / 2) {
        end = breakPoint + 1;
      }
    }

    const chunk = text.slice(start, end).trim();
    if (chunk.length > 30) chunks.push(chunk);

    start = end - overlap; // overlap = no lost context at boundaries
  }

  return chunks;
}

The 50-character overlap sounds tiny. It makes a huge difference.

Embeddings feel like magic until you understand them

An embedding is just a list of 1536 numbers that represents the meaning of a piece of text. Two sentences that mean similar things will produce similar number lists — even if they use completely different words.

So "What are the safety requirements?" and "List the security rules" will have similar embeddings, even though they share no words. That's semantic search. That's what makes this better than ctrl+F.

I chose text-embedding-3-small over the older ada-002. 80% cheaper, equal or better quality. Easy choice.

Batch embedding 400 chunks and watching them all get stored in the database in about 8 seconds was one of those quiet "oh wow" moments.

pgvector is genuinely impressive

I expected to need a dedicated vector database — Pinecone, Weaviate, Qdrant. I'd heard of all of them.

Then I discovered pgvector — a Postgres extension that adds a vector column type and similarity search operators. I already know SQL. I already use Supabase. It was a 5-line setup:

create extension if not exists vector;

create table document_chunks (
  id         uuid primary key default gen_random_uuid(),
  content    text not null,
  page_number integer,
  embedding  vector(1536)
);

create index on document_chunks
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

And querying it is just SQL:

select content, page_number,
       1 - (embedding <=> $1) as similarity
from document_chunks
order by embedding <=> $1
limit 5;

<=> is cosine distance. 1 - distance = similarity. I love how clean this is.

The system prompt matters more than the model

I spent more time on the system prompt than on any other single piece of code. The difference between a well-prompted model and a poorly-prompted one is dramatic.

My first prompt: "Answer questions using the provided context."

Results: confidently wrong answers, hallucinated details, vague summaries.

My final prompt, after many iterations:

You are a precise document assistant. Answer questions using
ONLY the provided context chunks.

- If the answer is in the context, answer clearly.
- If it isn't, say exactly: "I could not find a clear answer
  in the uploaded documents."
- Never make up information not present in the context.
- Be concise. Prefer bullet points for multi-part answers.

That single word "ONLY" and the explicit fallback phrase cut hallucinations significantly. The model still reasons and synthesises — it just stays grounded.

Temperature = 0.1, by the way. This isn't a creative writing task.

MCP was the rabbit hole I didn't expect

Halfway through the project I read about Model Context Protocol — a way to give LLMs structured tools they can call as function calls. Search a database, query an API, fetch live data.

I added two tools to my pipeline:

const tools = [
  {
    type: 'function',
    function: {
      name: 'search_documents',
      description: 'Search uploaded PDFs for relevant information',
      parameters: {
        type: 'object',
        properties: {
          query: { type: 'string' }
        },
        required: ['query']
      }
    }
  }
];

Now the model decides when to search. You can ask a multi-part question and it'll call the search tool, synthesise the results, and respond — all in one turn. No hardcoded routing logic. The model figures it out.

This is when I started to understand why everyone in AI engineering talks about agents.

The Numbers (Real, Measured)

I ran 50 test queries across a set of PDFs I had lying around.

Metric	Result
P95 response time	2.8 seconds
Average response time	1.9 seconds
Embedding cost for ~200 PDFs	~$0.80
Queries that returned correct page citation	84%
Queries where I'd say the answer was "good"	~82%

The 18% miss rate is real and honest. It's mostly on questions that require synthesising information across many pages — a known weakness of basic RAG. Hybrid search (combining vector + keyword BM25) would improve this. That's my next experiment.

What I'd Tell Myself Before Starting

Start with fewer PDFs than you think. I tried to test with 50 documents at once. Debug with 3. You'll thank yourself.

The similarity threshold matters. I filter out chunks below 30% cosine similarity before passing them to the LLM. Without this filter, irrelevant chunks confuse the model and produce vague, wishy-washy answers.

pdf-parse is good but imperfect. Scanned PDFs (images of text) return nothing — you need OCR for those. Text PDFs work great. Know your document types before you commit to a parsing strategy.

Per-page extraction from the start. I approximated page numbers. It works but isn't accurate. Use pdf.js if exact page attribution matters for your use case.

Why This Project Was Different

I build things constantly. APIs, dashboards, mobile apps. Most of it is satisfying in a normal way.

This one was different because every piece builds on the previous one in a way that feels like a proper system — not just features bolted together. The chunker feeds the embedder. The embedder feeds the vector store. The vector store feeds the retriever. The retriever feeds the synthesiser. Change one and it ripples through everything.

And the output is intelligent. It reads documents and understands them. I wrote the code that makes that happen, and I still find it a little bit remarkable every time I run it.

If you've been curious about RAG but haven't started — start. The gap between "I understand the concept" and "I built it" is where all the real learning happens.

The Stack

Backend: Node.js + Express
PDF parsing: pdf-parse
Embeddings: OpenAI text-embedding-3-small
Vector store: pgvector (Supabase free tier)
LLM: GPT-4o-mini
Tool calling: Model Context Protocol (MCP)
Frontend: React + Vite + TypeScript + Tailwind CSS

Full code is on my GitHub: github.com/logout007

I'm Pinaki Batabyal — Full Stack Developer and Technical Lead. I write about things I build,
break, and figure out. Connect with me on LinkedIn
if you're into this kind of thing.

Currently exploring senior fullstack and AI engineering roles — remote or Kolkata/Bangalore.

DEV Community: Pinaki Batabyal

I Built a RAG Pipeline From Scratch and It Completely Changed How I Think About AI

I Built a RAG Pipeline From Scratch and It Completely Changed How I Think About AI

How It Started

The Moment It Clicked

What I Learned Building Each Layer

Chunking is harder than it sounds

Embeddings feel like magic until you understand them

pgvector is genuinely impressive

The system prompt matters more than the model

MCP was the rabbit hole I didn't expect

The Numbers (Real, Measured)

What I'd Tell Myself Before Starting

Why This Project Was Different

The Stack