Every website has docs, FAQs, and knowledge buried across dozens of pages. I wanted to turn all of that into an AI chatbot that actually knows the website — not a generic GPT wrapper, but something grounded in real content.
So I built a RAG (Retrieval-Augmented Generation) pipeline: scrape a website, chunk the text, generate vector embeddings, store them in Postgres with pgvector, and retrieve relevant context at query time. Here's how the whole thing works under the hood.
The Architecture
The stack:
- Next.js 15 (App Router) — frontend + API routes
- Supabase — Postgres database with pgvector extension
-
OpenAI — embeddings (
text-embedding-3-small) and chat completions (gpt-4o-mini) - Cheerio — server-side HTML parsing for the scraper
The flow looks like this:
User enters URL → Scrape website → Chunk text → Generate embeddings → Store in pgvector
↓
Visitor asks question → Embed query → Vector similarity search → Build context → LLM generates answer
Step 1: The Web Scraper
The scraper starts at the user's URL and crawls internal links using a BFS approach, up to 20 pages. Cheerio handles HTML parsing — no headless browser needed.
export async function scrapeWebsite(url: string): Promise<ScrapedPage[]> {
const origin = new URL(url).origin;
const visited = new Set<string>();
const queue: string[] = [normalizeUrl(url)];
const results: ScrapedPage[] = [];
while (queue.length > 0 && results.length < MAX_PAGES) {
const current = queue.shift()!;
if (visited.has(current)) continue;
visited.add(current);
const page = await fetchAndParse(current);
if (!page) continue;
results.push({ url: current, title: page.title, content: page.content });
for (const link of page.links) {
const resolved = resolveLink(link, current, origin);
if (resolved && !visited.has(resolved)) queue.push(resolved);
}
}
return results;
}
A few important decisions here:
- Same-origin only — we skip external links to avoid scraping the entire internet
-
Content extraction — we strip
<script>,<style>,<nav>,<footer>, and[aria-hidden]elements, then prefer<main>or<article>content over the full<body> -
URL normalization — trailing slashes, tracking params (
utm_*), and hashes are stripped to avoid duplicate pages - Minimum content threshold — pages with less than 50 characters of extracted text are skipped (they're usually empty shells)
Step 2: Chunking for Embeddings
You can't just throw an entire page into an embedding model. The text needs to be split into chunks that are small enough to be meaningful but large enough to carry context.
export function chunkText(
text: string,
maxTokens: number = 500,
overlap: number = 50
): string[] {
const maxChars = maxTokens * CHARS_PER_TOKEN; // ~4 chars per token
const overlapChars = overlap * CHARS_PER_TOKEN;
const paragraphs = text.split(/\n\s*\n/);
const chunks: string[] = [];
let currentChunk = '';
for (const para of paragraphs) {
if (currentChunk.length + para.length > maxChars && currentChunk.length > 0) {
chunks.push(currentChunk.trim());
// Overlap: carry the tail of the previous chunk into the next
currentChunk = currentChunk.slice(-overlapChars) + ' ' + para;
} else {
currentChunk += (currentChunk ? '\n\n' : '') + para;
}
}
if (currentChunk.trim()) chunks.push(currentChunk.trim());
return chunks.filter(c => c.length > 20);
}
The overlap is key. Without it, if a relevant fact spans a chunk boundary, you lose it entirely. A 50-token overlap ensures continuity. We split on paragraph boundaries first (natural breakpoints), then fall back to sentence splitting for oversized paragraphs.
Step 3: Generating and Storing Embeddings
We use OpenAI's text-embedding-3-small model (1536 dimensions). Texts are batched to stay within token limits:
const EMBEDDING_MODEL = 'text-embedding-3-small';
const EMBEDDING_DIMENSIONS = 1536;
export async function generateEmbeddings(texts: string[]): Promise<number[][]> {
const batches = createBatches(texts, MAX_TOKENS_PER_BATCH);
const allEmbeddings: number[][] = [];
for (const batch of batches) {
const response = await getOpenAI().embeddings.create({
model: EMBEDDING_MODEL,
dimensions: EMBEDDING_DIMENSIONS,
input: batch,
});
const sorted = response.data.sort((a, b) => a.index - b.index);
for (const item of sorted) allEmbeddings.push(item.embedding);
}
return allEmbeddings;
}
On the Supabase side, we store documents with their embeddings and use a match_documents RPC function that performs cosine similarity search via pgvector:
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1536),
match_chatbot_id uuid,
match_count int DEFAULT 5
) RETURNS TABLE (
id uuid, content text, title text, source_url text, similarity float
) AS $$
SELECT id, content, title, source_url,
1 - (embedding <=> query_embedding) AS similarity
FROM documents
WHERE chatbot_id = match_chatbot_id
ORDER BY embedding <=> query_embedding
LIMIT match_count;
$$ LANGUAGE sql;
The <=> operator is pgvector's cosine distance. We return 1 - distance as similarity so higher = more relevant.
Step 4: The Chat Endpoint
This is where it all comes together. When a visitor asks a question, we need to:
- Embed their question
- Find the most relevant chunks
- Build a prompt with that context
- Stream the LLM's response
The critical insight was parallelizing the first phase:
const parallelResults = await Promise.all([
// Verify chatbot exists and is active
admin.from('chatbots').select('...').eq('id', chatbot_id).single(),
// Generate query embedding (~970ms)
generateEmbeddings([trimmedMessage]),
// Fetch conversation history
conversation_id
? admin.from('messages').select('role, content')
.eq('conversation_id', conversation_id).limit(10)
: Promise.resolve({ data: null, error: null }),
]);
The embedding call takes ~970ms. The database queries take ~50ms each. By running them in parallel instead of sequentially, we save almost a full second on every request.
The system prompt keeps the model grounded:
function buildSystemPrompt(chatbotName: string, context: string): string {
return `You are "${chatbotName}", a helpful AI assistant that answers
questions based on the website content provided below.
INSTRUCTIONS:
- Answer questions accurately using ONLY the provided context.
- If the context does not contain enough information, say so honestly.
- Do not make up information that is not in the context.
CONTEXT FROM WEBSITE:
${context || 'No relevant content found.'}`;
}
Performance: From 9.5s to 3s
The first version was painfully slow. Here's what the timing breakdown looked like:
| Phase | Before | After |
|---|---|---|
| Embedding | 970ms | 970ms (parallel) |
| DB lookups | 150ms | 50ms (parallel) |
| Vector search | 120ms | 120ms |
| LLM completion | 2-4s | 2-4s (streaming) |
| DB writes | 500ms | 0ms (deferred) |
| Total (perceived) | ~9.5s | ~3s |
Three optimizations made the difference:
1. Parallel I/O
Instead of sequential await calls, we use Promise.all() for independent operations. Embedding generation, chatbot verification, and conversation history all run simultaneously.
2. Streaming Responses
We switched from waiting for the full completion to Server-Sent Events (SSE) streaming. The user sees the first token within ~1s, even though the full response takes 3-4s to generate.
const stream = new ReadableStream({
async start(controller) {
const openaiStream = await getOpenAI().chat.completions.create({
model: chatbot.model || 'gpt-4o-mini',
messages: openaiMessages,
stream: true,
});
for await (const chunk of openaiStream) {
const content = chunk.choices[0]?.delta?.content || '';
if (content) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ content })}\n\n`)
);
}
}
controller.close();
},
});
3. Deferred Database Writes
All database writes (message storage, usage stats, performance metrics) happen after the response is sent to the user, using Next.js's after() API:
after(async () => {
await deferDbWrites({ admin, activeConversationId, ... });
});
This is critical on Vercel's serverless platform. The user gets their answer immediately; bookkeeping happens in the background.
Lessons Learned
1. Chunk overlap matters more than chunk size. I experimented with 200, 500, and 1000 token chunks. 500 with 50-token overlap gave the best retrieval quality. Too small and you lose context; too large and the embeddings become too generic.
2. pgvector is surprisingly fast. For datasets under 100K vectors, cosine similarity search over pgvector in Supabase completes in under 200ms without any indexing tricks. You don't need Pinecone or Weaviate for most use cases.
3. Fire-and-forget doesn't work on serverless. My first attempt used void saveMetrics() — a fire-and-forget pattern. On Vercel, the function shuts down the moment the response is sent, killing any pending promises. Next.js's after() solves this by extending the function's lifetime for background work.
4. Scraping is the hardest part. Not the AI, not the vector math — scraping. Every website has different HTML structure, JavaScript-rendered content, rate limits, and anti-bot measures. Cheerio handles static HTML well, but SPAs need a different approach.
5. The 'context window' UX problem. When the LLM can't find relevant context, it should say "I don't know" — but users find that frustrating. The trick is to acknowledge what they asked and suggest what the chatbot can help with, based on the available document titles.
The Full Training Pipeline
Putting it all together, the training flow is:
- User provides a URL
- Scraper crawls up to 20 pages (BFS, same-origin)
- Content is chunked (500 tokens, 50 overlap)
- Embeddings generated in batches via OpenAI
- Documents + embeddings stored in Supabase (pgvector)
- Chatbot status set to 'active'
The whole process takes 15-45 seconds depending on the website size.
At query time:
- Visitor's question is embedded
- pgvector finds the top 5 most similar chunks
- Chunks are injected into the system prompt as context
- gpt-4o-mini generates a grounded response
- Response is streamed back via SSE
If you want to see this in action, I built it into DocuChat — you can paste any website URL and have a working AI chatbot in under a minute.
The full codebase uses Next.js 15 App Router, Supabase with pgvector, and OpenAI's API. If you're building something similar, the key takeaway is: parallelize everything, stream responses, and defer writes. Your users will thank you.
Top comments (0)