I Built a RAG Pipeline From Scratch and It Completely Changed How I Think About AI
I've been writing code for 3+ years. I thought I understood how AI worked.
I didn't.
Not until I sat down one weekend, opened a blank Node.js project, and decided to build something I'd been curious about for months — a system that could read a stack of PDFs and actually answer questions about them. In plain English. With sources.
What followed was honestly one of the most satisfying weeks of building I've ever had.
How It Started
I'd been using ChatGPT like everyone else — pasting text, asking questions, getting answers. But I kept hitting the same wall: it didn't know my documents. It couldn't read a specific PDF I had. It couldn't search across multiple files. It couldn't say "this answer is on page 12."
I knew RAG (Retrieval-Augmented Generation) was the solution. I'd read about it. I understood it conceptually.
Actually building it is a completely different thing.
The Moment It Clicked
The first time I uploaded a PDF, typed a question, and got back a precise answer — with the exact page number — I genuinely sat back and stared at the screen for a few seconds.
Not because it was magic. But because I understood every single step that produced that answer. I wrote every line. I knew why it worked.
That feeling is hard to describe. It's different from using a library or calling an API. This was mine, end to end.
Here's the architecture I landed on:
User types a question
↓
Embed the question (OpenAI text-embedding-3-small)
↓
Find the most similar chunks in the database (pgvector)
↓
Feed those chunks into GPT-4o-mini
↓
Get a precise, grounded answer
Four steps. Deceptively simple on paper. Deeply interesting to build.
What I Learned Building Each Layer
Chunking is harder than it sounds
My first attempt: split text every 500 characters. Done.
The results were awful. Sentences got cut in half. Context got destroyed. The model would retrieve a chunk that started mid-sentence and couldn't make sense of it.
The fix was breaking on sentence boundaries with overlap:
function chunkText(text, chunkSize = 500, overlap = 50) {
const chunks = [];
let start = 0;
while (start < text.length) {
let end = start + chunkSize;
// Don't cut mid-sentence — find the nearest period
if (end < text.length) {
const breakPoint = text.lastIndexOf('.', end);
if (breakPoint > start + chunkSize / 2) {
end = breakPoint + 1;
}
}
const chunk = text.slice(start, end).trim();
if (chunk.length > 30) chunks.push(chunk);
start = end - overlap; // overlap = no lost context at boundaries
}
return chunks;
}
The 50-character overlap sounds tiny. It makes a huge difference.
Embeddings feel like magic until you understand them
An embedding is just a list of 1536 numbers that represents the meaning of a piece of text. Two sentences that mean similar things will produce similar number lists — even if they use completely different words.
So "What are the safety requirements?" and "List the security rules" will have similar embeddings, even though they share no words. That's semantic search. That's what makes this better than ctrl+F.
I chose text-embedding-3-small over the older ada-002. 80% cheaper, equal or better quality. Easy choice.
Batch embedding 400 chunks and watching them all get stored in the database in about 8 seconds was one of those quiet "oh wow" moments.
pgvector is genuinely impressive
I expected to need a dedicated vector database — Pinecone, Weaviate, Qdrant. I'd heard of all of them.
Then I discovered pgvector — a Postgres extension that adds a vector column type and similarity search operators. I already know SQL. I already use Supabase. It was a 5-line setup:
create extension if not exists vector;
create table document_chunks (
id uuid primary key default gen_random_uuid(),
content text not null,
page_number integer,
embedding vector(1536)
);
create index on document_chunks
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
And querying it is just SQL:
select content, page_number,
1 - (embedding <=> $1) as similarity
from document_chunks
order by embedding <=> $1
limit 5;
<=> is cosine distance. 1 - distance = similarity. I love how clean this is.
The system prompt matters more than the model
I spent more time on the system prompt than on any other single piece of code. The difference between a well-prompted model and a poorly-prompted one is dramatic.
My first prompt: "Answer questions using the provided context."
Results: confidently wrong answers, hallucinated details, vague summaries.
My final prompt, after many iterations:
You are a precise document assistant. Answer questions using
ONLY the provided context chunks.
- If the answer is in the context, answer clearly.
- If it isn't, say exactly: "I could not find a clear answer
in the uploaded documents."
- Never make up information not present in the context.
- Be concise. Prefer bullet points for multi-part answers.
That single word "ONLY" and the explicit fallback phrase cut hallucinations significantly. The model still reasons and synthesises — it just stays grounded.
Temperature = 0.1, by the way. This isn't a creative writing task.
MCP was the rabbit hole I didn't expect
Halfway through the project I read about Model Context Protocol — a way to give LLMs structured tools they can call as function calls. Search a database, query an API, fetch live data.
I added two tools to my pipeline:
const tools = [
{
type: 'function',
function: {
name: 'search_documents',
description: 'Search uploaded PDFs for relevant information',
parameters: {
type: 'object',
properties: {
query: { type: 'string' }
},
required: ['query']
}
}
}
];
Now the model decides when to search. You can ask a multi-part question and it'll call the search tool, synthesise the results, and respond — all in one turn. No hardcoded routing logic. The model figures it out.
This is when I started to understand why everyone in AI engineering talks about agents.
The Numbers (Real, Measured)
I ran 50 test queries across a set of PDFs I had lying around.
| Metric | Result |
|---|---|
| P95 response time | 2.8 seconds |
| Average response time | 1.9 seconds |
| Embedding cost for ~200 PDFs | ~$0.80 |
| Queries that returned correct page citation | 84% |
| Queries where I'd say the answer was "good" | ~82% |
The 18% miss rate is real and honest. It's mostly on questions that require synthesising information across many pages — a known weakness of basic RAG. Hybrid search (combining vector + keyword BM25) would improve this. That's my next experiment.
What I'd Tell Myself Before Starting
Start with fewer PDFs than you think. I tried to test with 50 documents at once. Debug with 3. You'll thank yourself.
The similarity threshold matters. I filter out chunks below 30% cosine similarity before passing them to the LLM. Without this filter, irrelevant chunks confuse the model and produce vague, wishy-washy answers.
pdf-parse is good but imperfect. Scanned PDFs (images of text) return nothing — you need OCR for those. Text PDFs work great. Know your document types before you commit to a parsing strategy.
Per-page extraction from the start. I approximated page numbers. It works but isn't accurate. Use pdf.js if exact page attribution matters for your use case.
Why This Project Was Different
I build things constantly. APIs, dashboards, mobile apps. Most of it is satisfying in a normal way.
This one was different because every piece builds on the previous one in a way that feels like a proper system — not just features bolted together. The chunker feeds the embedder. The embedder feeds the vector store. The vector store feeds the retriever. The retriever feeds the synthesiser. Change one and it ripples through everything.
And the output is intelligent. It reads documents and understands them. I wrote the code that makes that happen, and I still find it a little bit remarkable every time I run it.
If you've been curious about RAG but haven't started — start. The gap between "I understand the concept" and "I built it" is where all the real learning happens.
The Stack
- Backend: Node.js + Express
- PDF parsing: pdf-parse
- Embeddings: OpenAI text-embedding-3-small
- Vector store: pgvector (Supabase free tier)
- LLM: GPT-4o-mini
- Tool calling: Model Context Protocol (MCP)
- Frontend: React + Vite + TypeScript + Tailwind CSS
Full code is on my GitHub: github.com/logout007
I'm Pinaki Batabyal — Full Stack Developer and Technical Lead. I write about things I build,
break, and figure out. Connect with me on LinkedIn
if you're into this kind of thing.
Currently exploring senior fullstack and AI engineering roles — remote or Kolkata/Bangalore.
Top comments (0)