DEV Community

Cover image for How to Build a "Chat with Website" App using Next.js, LangChain, and Cheerio πŸ¦œπŸ”—
Atul Tripathi
Atul Tripathi

Posted on

How to Build a "Chat with Website" App using Next.js, LangChain, and Cheerio πŸ¦œπŸ”—

Building RAG (Retrieval Augmented Generation) apps usually starts with PDFs. πŸ“„
But let's be honest: users really want to chat with live URLsβ€”documentation, wikis, and blogs. 🌐

I spent this weekend adding a Web Scraper to my RAG Starter Kit. Here is the technical breakdown of how I built it, so you can do it too. πŸ‘‡

πŸ›‘ The Problem with Scraping for LLMs

You can't just fetch(url) and pass the HTML to GPT-4.

  1. Too much noise: Navbars, footers, and ads waste tokens. πŸ’Έ
  2. Context Window: Raw HTML is huge and confuses the model.
  3. Headless Browsers: Tools like Puppeteer are heavy and often timeout on serverless functions (like Vercel). ⏳

πŸ›  The Stack

  • Framework: Next.js 14
  • Scraper: Cheerio (via LangChain). It parses HTML like jQuery, making it lightweight and fast. ⚑️
  • Vector DB: Pinecone (Serverless).

Step 1: The Scraper Logic πŸ•·οΈ

We use CheerioWebBaseLoader from LangChain. It grabs the raw HTML and lets us select only the body or specific content tags (like <article>), ignoring the junk.

import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";

export async function scrapeUrl(url) {
  // 1. Load the URL
  const loader = new CheerioWebBaseLoader(url, {
    selector: "p, h1, h2, h3, article", // 🎯 Only grab text content
  });

  const docs = await loader.load();
  return docs;
}
Enter fullscreen mode Exit fullscreen mode

Step 2: The Cleaning (Smart Chunking) 🧹

LLMs need manageable chunks of text. If you cut a sentence in half, you lose context. We use RecursiveCharacterTextSplitter.


import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // πŸ“ Tokens per chunk
  chunkOverlap: 200, // πŸ”— Overlap to preserve context across chunks
});

const splitDocs = await splitter.splitDocuments(docs);
Enter fullscreen mode Exit fullscreen mode

Step 3: The Cost Hack (1024 Dimensions) πŸ’‘

This is the most important part! πŸ’°

By default, OpenAI's embedding models output 1536 dimensions. But Pinecone charges based on storage size.

OpenAI's new text-embedding-3-small allows you to "shorten" the dimensions with minimal accuracy loss.

I configured my implementation to force 1024 dimensions:


const embeddings = new OpenAIEmbeddings({
  modelName: "text-embedding-3-small",
  dimensions: 1024, // πŸ“‰ Saves ~33% on storage costs
});
Enter fullscreen mode Exit fullscreen mode

βœ… The Result

We now have a clean pipeline: URL ➑️ Clean Text ➑️ Chunks ➑️ Vectors ➑️ Chat.

This allows users to point the app at their documentation and ask questions immediately.


🎁 Want the Full Source Code?

I cleaned up this entire logic (plus Multi-File PDF support, Mobile UI, and Streaming response) and packaged it into a production-ready Starter Kit called FastRAG.

It saves you the ~40 hours of setting up the boilerplate so you can focus on building your AI SaaS over the holiday break. πŸŽ…

🏁 I'm running a "Holiday Build" race:

πŸ₯‡ First 69 devs: Get 69% OFF (~$9). Code: FAST69

πŸ₯ˆ Everyone else: Get 40% OFF. Code: HOLIDAY40

Check out the Live Demo & Repo here: πŸ‘‰ FastRAG

Happy coding and happy holidays! πŸŽ„πŸš€

Top comments (0)