Harshdeep Singh

Posted on Jun 9 • Originally published at theharshdeepsingh.com

Building an LLM Project From Scratch in 2026

#rag #agenticai #llm #mongodbatlas

Here’s the uncomfortable truth about “AI projects” a few years ago: the hard part was never the model. It was the plumbing. Standing up a vector database, wiring an embeddings pipeline, fighting with streaming responses, gluing five libraries together — by the time it worked, you’d forgotten what you set out to build.

In 2026 that plumbing has largely collapsed into a weekend’s worth of work. Model prices have fallen roughly 80% year over year, free tiers are genuinely usable, your database now does vector search natively, and one SDK handles streaming and tool calling across every provider. The skill that’s actually in demand — retrieval-augmented generation with agents — is now reachable by a developer who has never touched machine learning.

So this guide does something specific. We’re going to build one real project, end to end, that you can put on your portfolio and let strangers use: an app where someone uploads their documents and chats with them — asking questions and getting answers grounded in their own files, with citations, streamed token by token. It’s the canonical 2026 LLM project, and it teaches almost everything else by osmosis.

In plain English. “RAG” means the AI doesn’t answer from memory — it looks things up first. You give it a pile of documents; when you ask a question, it finds the most relevant passages and answers using only those. That’s why it can talk about your files without ever having been trained on them.

This guide is written for three readers at once: newcomers, working software engineers, and AI engineers. The main text stays approachable; the “In plain English” notes add no-jargon explanations, and the “Under the hood” notes add depth for engineers.

The roadmap — what we’ll actually do

Eight steps. Each one produces something that works before we add the next layer.

Build the mental model — how RAG (and agentic RAG) really works, in one diagram.
Choose your models — a cost comparison of hosted and self-hosted LLMs, and which embedding model to use.
Set up the MERN stack — project skeleton plus a MongoDB Atlas vector index.
Ingest documents — upload, parse a PDF, and split it into chunks.
Embed & store — turn chunks into vectors and save them in MongoDB.
Retrieve — find the right passages with a single $vectorSearch query.
Make it agentic — let the model call retrieval as a tool, on its own terms.
Stream & deploy — render tokens live in React, then ship it to its own URL for free.

Let’s start with the one idea that makes the other seven make sense.

Step 1 · The mental model: how RAG actually works

An LLM is a brilliant improviser with no access to your private data and a tendency to confidently make things up. Retrieval-Augmented Generation (RAG) fixes both problems with one move: before the model answers, you fetch relevant facts and hand them over as context. The model then answers from evidence rather than from vibes.

There are two phases. The first happens once, ahead of time (ingestion); the second happens on every question (retrieval + generation).

INGESTION (run once, when a document is added)
  document --> split into chunks --> embed each chunk --> store vectors in MongoDB

RETRIEVAL + GENERATION (run on every question)
  question --> embed --> vector search in MongoDB --> top-k chunks
                                                         |
                             +---------------------------+
                             v
        [ question + retrieved chunks ] --> LLM --> grounded answer + citations

The magic ingredient is the embedding: a list of numbers (a vector) that captures the meaning of a piece of text. Two passages about “canceling a subscription” land near each other in this number-space even if one says “refund” and the other says “cancel my plan.” Searching by meaning instead of keywords is what makes RAG feel intelligent.

In plain English. Imagine every sentence gets pinned onto a giant map, where similar meanings sit close together. To answer your question, the app drops a pin for your question and grabs whatever text is pinned nearby. Those nearby notes become the AI’s cheat sheet.

What makes it “agentic” — the 2026 upgrade

Classic RAG retrieves once and hopes the first search was good enough. That breaks on real questions: “Compare the refund policy in the 2024 contract with the 2025 one” needs two different searches and a comparison. Agentic RAG hands the model the steering wheel. Retrieval becomes a tool the model can call — repeatedly — deciding what to search for, judging whether the results are sufficient, and searching again before it commits to an answer.

Under the hood. A 2025 survey (“Agentic Retrieval-Augmented Generation,” arXiv:2501.09136) frames these systems around four patterns: reflection, planning, tool use, and multi-agent collaboration. In practice you’ll implement query rewriting/decomposition, multi-hop “retrieve → reason → retrieve” loops, and self-critique (“do these passages actually answer the question?”). The cost: 3–10× the tokens and 2–5× the latency of vanilla RAG. So gate it — a trivial FAQ should never enter the loop; a cross-document question can’t be answered without it. Cap the loop at ~5 iterations so a confused agent can’t spend your budget in a runaway.

We’ll build the simple pipeline first (so you can see every piece), then promote retrieval to an agentic tool in Step 7. That progression is the lesson.

Step 2 · Choosing your LLMs — cheapest viable first

You’ll use two kinds of model: an embedding model (turns text into vectors) and a generation model (writes the answer). They’re priced and chosen separately. Let’s start with generation, since that’s where the “which LLM?” anxiety lives.

Read this first. Every price below is a 2026 snapshot and model names change almost monthly. Treat this table as a shape, not gospel — confirm the current number on the provider’s pricing page before you commit. The strategy (route cheap, escalate rarely) outlives any specific figure.

Provider / model	Input ($/1M)	Output ($/1M)	Free tier?	Best for
Google Gemini Flash-Lite	~$0.10–0.25	~$0.40–1.50	Yes — generous	Learning & high volume; the default starter
Groq · Llama 3.1 8B	~$0.05	~$0.08	Yes	Blazing-fast responses; cheapest tokens
DeepSeek	~$0.14	~$0.28	No (cheap)	Cheapest frontier-class; OpenAI-compatible API
OpenAI · GPT mini-tier	~$0.15–0.75	~$0.60–4.50	Credits	Strong all-rounder; great tool calling
Anthropic · Claude Haiku	~$1.00	~$5.00	No	Cheapest Claude; reliable instruction-following
Frontier (GPT / Claude / Gemini Pro)	~$2.50–5.00	~$15–30	No	Final answer only, when quality truly matters

The pattern jumps out: the cheapest models are 50–100× cheaper than the flagships. For a RAG app, most of your token spend is feeding retrieved context into the model — so a cheap, capable model handling that bulk is the entire cost game. Use a frontier model only for the final synthesis, and only if you can measure that it’s actually better for your task.

Self-hosted: running models on your own machine

You can skip API bills entirely with Ollama, which runs open models locally and exposes an OpenAI-compatible endpoint at http://localhost:11434/v1 — meaning your code barely changes. One command pulls and runs a model:

# install from ollama.com, then:
ollama run llama3.1:8b          # chat model, ~6-8 GB VRAM
ollama pull nomic-embed-text    # local embedding model, free

Under the hood. Rough VRAM at Q4 quantization: 7–8B models ≈ 6–8 GB, 14B ≈ 10–12 GB, 32B ≈ 20–22 GB, 70B ≈ 43–48 GB (Apple Silicon unified memory counts fully). Break-even vs hosted APIs is roughly 500K tokens/day of sustained traffic — below that, hosted is cheaper and you skip the ops. Trade-offs: full privacy and $0/token, but weaker reasoning than frontier models and you own the uptime.

The embedding model — quieter, but it matters

Embeddings are dramatically cheaper than generation, so this is an easy call. For most projects, OpenAI’s text-embedding-3-small is the sweet spot.

Model	Dimensions	Price ($/1M)	Notes
OpenAI text-embedding-3-small	1536	~$0.02	Best balance; our pick for the build
Google gemini-embedding	768	Free tier / ~$0.025	Free-tier friendly
Voyage (voyage-3.5-lite)	512–1024 (reducible)	~$0.02	Now MongoDB-owned; long context
nomic-embed-text / BGE-M3 (open)	768 / 1024	Free (self-host)	Run free in Ollama; great quality

Gotcha that bites everyone. Vectors from different embedding models are not compatible. If you switch embedding models later, you must re-embed your entire corpus. Pick one and commit. (Embedding 10M chunks with text-embedding-3-small costs only ~$100, so this is about consistency, not cost.)

Our choices for this build: embeddings via text-embedding-3-small; generation via a cheap, fast model (Gemini Flash or a GPT mini-tier model) while learning — swappable in one line thanks to the SDK we’re about to set up.

Step 3 · Setting up the MERN stack

MERN is a natural fit for RAG in 2026 for one reason that didn’t used to be true: MongoDB does vector search natively. Your embeddings live in the same documents as your data, queried with a normal aggregation pipeline. No separate vector database to run, sync, or pay for.

Here’s the shape of the app — a standard MERN split, with the LLM logic living safely on the server:

MongoDB Atlas — stores documents, chunks, embeddings, and chat history. Free M0 tier includes vector search.
Express + Node.js — the API: handles uploads, embedding, retrieval, and talking to the LLM. All API keys live here, never in the browser.
React — the chat UI, rendering streamed tokens as they arrive.
The glue: the Vercel AI SDK — one library for streaming, provider-switching, and tool calling, on both server and client.

The one piece of setup that’s new: the vector index

After creating a free cluster on Atlas, you define a vector search index on the collection that will hold your chunks. This tells MongoDB how to search the embedding field. In the Atlas UI (Atlas Search → Create Index → JSON editor), or via code:

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    { "type": "filter", "path": "userId" }
  ]
}

Under the hood. numDimensions must exactly match your embedding model’s output (1536 for text-embedding-3-small). similarity can be cosine, euclidean, or dotProduct — cosine is the safe default. The filter field on userId is what lets you scope searches per-user, so visitors only ever retrieve their own documents — essential the moment your demo is public. Atlas uses HNSW for approximate nearest-neighbor search under the hood.

That’s the only “AI-specific infrastructure” in the whole project. Everything else is ordinary Express and React.

Step 4 · Ingesting documents: upload, parse, chunk

When a user uploads a file, three things happen on the server: we accept the upload, extract its text, and split that text into bite-sized chunks.

Accepting uploads is standard Express (use multer). Extracting text from a PDF is a one-liner with pdf-parse (v2 fork, TypeScript-native — see note in code):

import { PDFParse } from "pdf-parse"; // requires the v2 fork, not the default pdf-parse@1

// `buffer` is the uploaded file from multer
const parser = new PDFParse({ data: buffer });
const { text } = await parser.getText();

Why we chunk — and how big

You can’t embed an entire 50-page PDF as one vector; the meaning gets blurred into mush, and you’d feed the model far more than it needs. So we slice the text into passages. Each chunk becomes one searchable unit.

// ~1 token = ~4 characters, so 2000 chars = ~500 tokens.
// We overlap chunks so a sentence split across a boundary
// still appears whole in at least one chunk.
function chunkText(text, size = 2000, overlap = 200) {
  const chunks = [];
  for (let i = 0; i < text.length; i += size - overlap) {
    chunks.push(text.slice(i, i + size).trim());
  }
  return chunks.filter(Boolean);
}

In plain English. Think of chunking like cutting a long article into index cards. Too big and each card covers too many topics to be useful; too small and you lose context. A few hundred words per card, with a little overlap so sentences don’t get sliced in half, is the reliable starting point.

Under the hood. Start with recursive character splitting at ~400–512 tokens with 10–20% overlap — the pragmatic default (~85–90% retrieval recall). Semantic chunking can add ~2–3% recall but costs roughly 14× more to index, so only graduate to it when your evaluation metrics demand it. One caveat: at least one 2026 analysis found overlap added no measurable benefit in its setup while raising indexing cost — so treat the overlap figure as a starting point to validate against your own data, not a law.

Step 5 · Embedding & storing vectors in MongoDB

Now we turn each chunk into a vector and save it. The AI SDK’s embedMany batches the whole array efficiently, then we write one MongoDB document per chunk — text and vector together, tagged with the owner and source.

import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";

const chunks = chunkText(text);                 // from Step 4

const { embeddings } = await embedMany({
  model: openai.embedding("text-embedding-3-small"),
  values: chunks,                               // array of strings
});

await db.collection("chunks").insertMany(
  chunks.map((chunk, i) => ({
    userId,                                     // who owns it
    source: filename,                           // where it came from
    text: chunk,                                // the passage itself
    embedding: embeddings[i],                   // the 1536-dim vector
    chunkIndex: i,
    createdAt: new Date(),
  }))
);

That’s ingestion done. The vector is just an array of floats stored on a normal document — no special database, no migration. Upload a 30-page PDF and you’ve got a few dozen searchable, meaning-aware chunks sitting in MongoDB.

Under the hood. embedMany auto-batches large arrays, so you can hand it hundreds of chunks without managing request limits yourself. Store rich metadata (page number, section heading, document ID) alongside each chunk now — you’ll want it later for citations, filtering, and “parent-document” retrieval. This is the step you run once per upload, not once per question.

Step 6 · Retrieval: finding the right passages

Here’s the payoff for all that setup. To answer a question, we embed the question with the same model, then run a single $vectorSearch aggregation to pull the closest chunks. This is the whole of “search by meaning,” in one query:

import { embed } from "ai";
import { openai } from "@ai-sdk/openai";

const { embedding } = await embed({
  model: openai.embedding("text-embedding-3-small"),
  value: userQuestion,
});

const passages = await db.collection("chunks").aggregate([
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: embedding,
      numCandidates: 150,        // over-fetch, then narrow
      limit: 5,                  // keep the best 5
      filter: { userId: { $eq: currentUserId } }
    }
  },
  {
    $project: {
      _id: 0,
      text: 1,
      source: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
]).toArray();

You now have the five most relevant passages, each with a similarity score. Feed those into the model as context and you have working RAG. But before we generate, two notes that separate a toy from something good:

Under the hood. $vectorSearch must be the first stage in the pipeline. numCandidates is the approximate-search breadth — it must be ≥ limit, and a common heuristic is 10–20× your limit (here 150 for a limit of 5). The filter on userId uses the field we declared in the index, enforcing per-user isolation efficiently. Use the score as a relevance gate: if the top result scores below ~0.75, it’s often better to answer “I don’t have enough information” than to let the model hallucinate from weak matches.

The single biggest quality upgrade you’ll make later isn’t a bigger model — it’s hybrid search + reranking (we cover it in “Where to go next”). For now, vector search alone is plenty to ship.

Step 7 · Making retrieval agentic

So far the server retrieves before calling the model — a fixed pipeline. To make it agentic, we flip the control: we describe retrieval as a tool, hand it to the model, and let the model decide when (and how often) to call it. This is the “hot” part of 2026 — and with the AI SDK it’s remarkably little code.

import { streamText, tool, embed, stepCountIs } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

// 1) Retrieval, described as a tool the model can call
const searchDocuments = tool({
  description: "Search the user's uploaded documents for passages " +
               "relevant to a question. Call this whenever you need facts.",
  inputSchema: z.object({
    query: z.string().describe("a focused search query"),
  }),
  execute: async ({ query }) => {
    const { embedding } = await embed({
      model: openai.embedding("text-embedding-3-small"),
      value: query,
    });
    return db.collection("chunks").aggregate([
      { $vectorSearch: {
          index: "vector_index", path: "embedding",
          queryVector: embedding, numCandidates: 150, limit: 5,
          filter: { userId: { $eq: currentUserId } } } },
      { $project: { _id: 0, text: 1, source: 1,
          score: { $meta: "vectorSearchScore" } } },
    ]).toArray();
  },
});

// 2) Let the model run the loop: think -> search -> (search again) -> answer
const result = streamText({
  model: openai("gpt-4o-mini"),     // swap to any provider in one line
  system: "Answer ONLY using passages returned by searchDocuments. " +
          "Cite the source. If the passages don't contain the answer, " +
          "say you don't know - do not guess.",
  messages,               // from req.body — full chat history from the client
  tools: { searchDocuments },
  stopWhen: stepCountIs(5),         // hard cap on the agentic loop
});

return result.toUIMessageStreamResponse();

Read what that does, because it’s genuinely different from classic RAG: the model receives the question, decides on its own to call searchDocuments with a query it wrote, reads the results, and may call it again with a refined query before answering. For “compare the 2024 and 2025 refund policies,” it can naturally run two searches and synthesize. You didn’t orchestrate that — the model did.

Important 2026 change. If you learned the AI SDK before v5: maxSteps was removed from the client. Multi-step tool loops are now controlled server-side with stopWhen (e.g. stepCountIs(5)). This cap is also your cost safety rail — without it, a confused agent could loop and run up your bill.

Under the hood. The system prompt is doing heavy lifting for safety and grounding: “answer only from retrieved passages” plus “say you don’t know” is your first and cheapest defense against hallucination. The Zod inputSchema gives the model a typed contract for the tool’s arguments. toUIMessageStreamResponse() emits a standard SSE stream the React client consumes natively. Want it reusable across other AI clients (Claude Desktop, Cursor, etc.)? Expose this same retrieval as an MCP server — overkill for a single app, but the natural next step if your tools should be shared.

Step 8 · Streaming to React, then deploying for free

The backend streams tokens; the frontend renders them as they land. The AI SDK’s useChat hook handles the entire streaming lifecycle, so your component stays tiny:

"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export default function Chat() {
  const [input, setInput] = useState("");
  const { messages, sendMessage, status } = useChat({
    transport: new DefaultChatTransport({ api: "/api/chat" }),
  });

  return (
    <div>
      {messages.map((m) => (
        <div key={m.id}>
          <strong>{m.role}: </strong>
          {m.parts.map((p, i) =>
            p.type === "text" ? <span key={i}>{p.text}</span> : null
          )}
        </div>
      ))}

      <input value={input} onChange={(e) => setInput(e.target.value)} />
      <button
        disabled={status !== "ready"}
        onClick={() => { sendMessage({ text: input }); setInput(""); }}
      >
        {status === "streaming" ? "Thinking..." : "Ask"}
      </button>
    </div>
  );
}

The status field (ready / submitted / streaming / error) gives you loading and disabled states for free. Tokens appear live as the model writes them — the experience people now expect from any AI app.

Putting it on its own website — for free

This is the part that turns a tutorial into a portfolio piece. Three boxes, three free tiers:

Vercel — React frontend, free Hobby tier, global CDN, custom domain, auto-deploy from GitHub.
Render — Node/Express backend, free web service (sleeps when idle), where streaming and keys live.
MongoDB Atlas M0 — database + vector search, permanently free, 512 MB, limited Vector Search index capacity (verify current limits in Atlas docs).

Total cost for a low-traffic demo: $0–$5/month.

Under the hood. Run the streaming endpoint on the long-lived backend (Render/Railway), not a short serverless function that can time out mid-stream. Know the free-tier edges: Render free services spin down after ~15 min idle (a ~30–60s cold start on the next request — fine for a portfolio), and Atlas M0 caps at 512 MB and limited Vector Search index capacity. When you outgrow them, a dedicated Atlas tier and an always-on backend plan are the upgrade path.

Cost controls so you never get a surprise bill

Set a hard spend cap in your LLM provider dashboard. Non-negotiable for a public demo.
Default to a cheap model; cap max_output_tokens; keep the agentic stepCountIs low.
Add per-user / per-IP rate limiting and an auth wall so bots can’t drain your quota.
Log token usage per request (the SDK’s onFinish callback) so you can see costs before they surprise you.
Keep every API key on the server. A key shipped to the browser is a key that will be stolen.

Common pitfalls (and the fixes)

The semantic gap. Your question and the document use different words and vector search misses the match. Fix: add hybrid search (keyword + vector).
Context dilution. You retrieve 10 chunks when only 2 are relevant, and the noise degrades the answer. Fix: rerank, then keep a tighter top-k.
Chunk-boundary amnesia. The answer is split across two chunks and neither is retrieved whole. Fix: overlap, or parent-document retrieval.
Confident nonsense. The model answers from weak matches as if certain. Fix: a similarity-score threshold plus a system prompt that permits “I don’t know.”
Reaching for the agent loop too early. Simple lookups don’t need multi-hop reasoning — they need one fast search. Fix: gate the agentic path to genuinely complex questions.

Where to go next

You’ve shipped a working agentic RAG app. Three upgrades, in priority order:

Hybrid search + reranking — the highest-ROI quality jump. Run keyword (Atlas full-text via $search) and vector search, fuse them with Reciprocal Rank Fusion (RRF), then rerank the top 20–50 candidates with a cross-encoder (Cohere, Voyage, or self-hosted BGE) and keep the best handful. Benchmarks routinely show reranking as the single biggest accuracy gain.
Better ingestion — messy PDFs with tables and multi-column layouts need a real parser (LlamaIndex’s LiteParse, Unstructured, or Docling) rather than plain text extraction.
Make it shareable via MCP — expose your retrieval as a Model Context Protocol server so other AI clients can use the same tool. Worth it once your tools outlive this one app.

That’s the whole arc: from “an LLM can’t see my data” to a public, agentic, document-grounded chat app that costs about nothing to run. The plumbing finally got out of the way — what you build on top of it is the interesting part. Now go put something on that empty portfolio URL.

Frequently asked questions

What is agentic RAG, exactly?

Agentic RAG turns retrieval into a tool the model calls on demand. Instead of one fixed “retrieve then answer” pass, the model plans, searches, judges whether the results are sufficient, and searches again until it has enough evidence — then answers. It’s slower and costs more tokens, but it handles complex, multi-step questions that one-shot RAG can’t.

Do I need a separate vector database?

No. On the MERN stack you store embeddings inside your normal MongoDB documents and query them with the $vectorSearch aggregation stage in MongoDB Atlas. For the vast majority of projects, that removes the need for a dedicated vector database entirely.

What’s the cheapest LLM for a RAG app in 2026?

For learning, the most generous free tiers are Google Gemini Flash/Flash-Lite and Groq. For the cheapest paid frontier-class model, DeepSeek is usually lowest. Prices change monthly — confirm on the provider’s pricing page. The durable strategy is to route bulk work to a cheap model and reserve a frontier model for the final answer only.

How much does it cost to run?

A portfolio-grade demo runs at roughly $0–$5/month: MongoDB Atlas free M0, a free LLM tier, embeddings at ~$0.02 per million tokens, and free deploy tiers on Vercel and Render. Set a provider spend cap and rate limits so a public demo can’t surprise you.

LangChain, LlamaIndex, or the Vercel AI SDK?

For a MERN streaming chat app, the Vercel AI SDK plus direct MongoDB vector queries is the lighter, recommended path in 2026. Reach for LlamaIndex.TS if your main challenge is heavy document ingestion, or LangChain.js/LangGraph for complex multi-agent orchestration. For ~90% of RAG web apps, the AI SDK is the right call.

Can I run this fully offline with a local model?

Yes. Ollama runs open models locally and exposes an OpenAI-compatible endpoint, so your code barely changes. Use a local embedding model like nomic-embed-text too. It’s ideal for development and privacy-sensitive data; for a low-traffic public demo, hosted free tiers are usually simpler and cheaper.

A note on accuracy. LLM pricing, model names, and SDK APIs change fast. Every figure here is a 2026 snapshot — verify current prices on each provider’s pricing page and current method names in the Vercel AI SDK docs before shipping to production. Code samples are illustrative walkthroughs, not drop-in files. Images: “Visualising AI” by Google DeepMind on Unsplash, free under the Unsplash License.

TL;DR

You can build and ship a production-shaped, agentic RAG “chat with your documents” app entirely on MERN for about $0/month: MongoDB Atlas’ free tier (with built-in vector search), a free LLM tier (Gemini Flash or Groq), embeddings at $0.02 per million tokens, and free deploys on Vercel + Render.
The 2026 stack is leaner than you think: MongoDB Atlas Vector Search (no separate vector database), the Vercel AI SDK for streaming + tool calling, and “agentic retrieval” — where the model itself decides when and what to search — instead of the old retrieve-once-then-answer pipeline.
Pick models by job, not by brand. Route cheap, high-volume work to Gemini Flash-Lite, DeepSeek, or Groq-hosted Llama; reserve a frontier model only for the final answer when quality matters. Self-hosting with Ollama only beats hosted APIs above heavy, sustained traffic.
By the end you’ll understand embeddings, chunking, vector retrieval, tool calling, token streaming in React, and how to put the whole thing on its own public website — with cost controls so you never get a surprise bill.

DEV Community