Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. π
But let's be honest: users really want to chat with live URLsβdocumentation, wikis, and blogs. π
I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. π
π The Problem with Scraping for LLMs
You can't just fetch(url) and pass the HTML to GPT-4.
- Too much noise: Navbars, footers, and ads waste tokens. πΈ
- Context Window: Raw HTML is huge and confuses the model.
- Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). β³
π The Stack
- Framework: Next.js 14
-
Scraper:
Cheerio(via LangChain). It parses HTML like jQuery, making it lightweight and fast. β‘οΈ - Vector DB: Pinecone (Serverless).
Step 1: The Scraper Logic π·οΈ
We use CheerioWebBaseLoader from LangChain. It grabs the raw HTML and lets us select only the body or specific content tags (like <article>), ignoring the junk.
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";
export async function scrapeUrl(url) {
// 1. Load the URL
const loader = new CheerioWebBaseLoader(url, {
selector: "p, h1, h2, h3, article", // π― Only grab text content
});
const docs = await loader.load();
return docs;
}
Step 2: The Cleaning (Smart Chunking) π§Ή
LLMs need manageable chunks of text. If you cut a sentence in half, you lose context. We use RecursiveCharacterTextSplitter.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // π Tokens per chunk
chunkOverlap: 200, // π Overlap to preserve context across chunks
});
const splitDocs = await splitter.splitDocuments(docs);
Step 3: The Cost Hack (1024 Dimensions) π‘
This is the most important part! π°
By default, OpenAI's embedding models output 1536 dimensions. But Pinecone charges based on storage size.
OpenAI's new text-embedding-3-small allows you to "shorten" the dimensions with minimal accuracy loss.
I configured my implementation to force 1024 dimensions:
const embeddings = new OpenAIEmbeddings({
modelName: "text-embedding-3-small",
dimensions: 1024, // π Saves ~33% on storage costs
});
β The Result
We now have a clean pipeline: URL β‘οΈ Clean Text β‘οΈ Chunks β‘οΈ Vectors β‘οΈ Chat.
This allows users to point the app at their documentation and ask questions immediately.
π Want the Full Source Code?
I cleaned up this entire logic (plus Multi-File PDF support, Mobile UI, and Streaming response) and packaged it into a production-ready Starter Kit called FastRAG.
It saves you the ~40 hours of setting up the boilerplate so you can focus on building your AI SaaS over the holiday break. π
π I'm running a "Holiday Build" race:
π₯ First 69 devs: Get 69% OFF (~$9). Code: FAST69
π₯ Everyone else: Get 40% OFF. Code: HOLIDAY40
Check out the Live Demo & Repo here: π FastRAG
Happy coding and happy holidays! ππ
Top comments (0)