My portfolio site has project pages, work experience entries, and blog posts, all written as MDX files. When someone visits, they usually have a specific question: "Has this person worked with React?" or "What's their most recent project?" The answer is somewhere on the site, but finding it means clicking through pages and scanning project cards.
I wanted visitors to be able to just ask. Not a FAQ page with canned answers, but something that reads the actual content on the site and answers questions from it.
Why Not Just Feed It Everything?
Your first thought might be: take all the content, send it to a language model like GPT-4o or Claude, and let it answer questions. This works for short content. But language models hallucinate. Ask about a technology you never mentioned, and the model might confidently say "yes, they have 3 years of experience with that" because it sounds plausible.
There's also a scale problem. My site has around 30 content files. Sending all of them as context every time someone asks a question is wasteful, and the more content you include, the more room there is for the model to drift.
Search First, Then Answer
Instead of sending everything, what if I first searched my own content to find the pieces relevant to the question, and only sent those to the model? That's the core idea behind RAG (Retrieval-Augmented Generation). The model writes its answer from a small, focused set of context instead of your entire site. Because it only sees what's relevant, it stays grounded in what's actually there.
To make this work, I needed three things: a way to split my content into searchable pieces, a way to search by meaning (not just keywords), and a language model to write the final answer.
Splitting Content Into Chunks
My content lives in MDX files: one per project, one per job, one per blog post. Some of these are long. A single project page might describe the tech stack, what I built, and how it works, all in one file. Sending an entire file as context when the user only asked about the tech stack wastes tokens and adds noise.
So I split each file into smaller chunks at paragraph boundaries, capped at 500 characters:
function chunkText(text: string, maxLen = 500): string[] {
const paragraphs = text.split(/\n\n+/);
const chunks: string[] = [];
let current = "";
for (const para of paragraphs) {
if (current.length + para.length > maxLen && current) {
chunks.push(current.trim());
current = para;
} else {
current += (current ? "\n\n" : "") + para;
}
}
if (current.trim()) chunks.push(current.trim());
return chunks;
}
One thing I learned through testing: raw chunks with no context confused the model. A chunk that says "Built with TypeScript and PostgreSQL" is meaningless without knowing whether it's describing a personal project or a company I worked at. The fix was adding type prefixes. Every chunk starts with [PROJECT], [WORK EXPERIENCE], [BLOG POST], or [PROFILE], so the AI immediately knows what kind of content it's looking at. I also added catalog chunks (complete lists of all projects or all work history) so questions like "list all my projects" don't return partial results.
Searching by Meaning
Now I have chunks, but how do I find which ones are relevant to a question? Keyword search is the obvious choice, but it's brittle. If someone asks about "React experience" and my project description says "built with NextJS", there's no keyword match, even though NextJS is a React framework.
This is where embeddings come in. An embedding model takes a piece of text and converts it into a list of numbers that represent its meaning. "React" and "NextJS" produce similar numbers because they're related concepts. "PostgreSQL" and "Redis" end up close together because they're both databases. When someone asks about "React experience", the question gets converted to numbers too, and it naturally lands close to anything frontend-related in my content.
To convert text into these numbers, you need an embedding model. My first attempt used the HuggingFace Inference API, which worked, but had a problem: 0.5 seconds when the model was warm, 9.4 seconds when it was cold. HuggingFace spins down free-tier models after inactivity, so the chatbot would randomly hang for nearly 10 seconds. I switched to running the same model locally. all-MiniLM-L6-v2 is a popular open-source option, only 22MB, and it produces 384 numbers per piece of text in about 12ms:
import { pipeline } from "@huggingface/transformers";
const extractor = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2");
async function embedText(text: string): Promise<number[]> {
const result = await extractor(text, { pooling: "mean", normalize: true });
return result.tolist()[0]; // 384 numbers
}
At build time, I run this on every chunk and save the results to a JSON file. At runtime, I embed the user's question and find the closest chunks by comparing their numbers using cosine similarity (how much two sets of numbers point in the same direction):
async function searchChunks(query: string, topK = 8) {
const queryEmbedding = await embedText(query);
return chunks
.map((chunk) => ({
...chunk,
score: cosineSimilarity(queryEmbedding, chunk.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
If you're working with thousands of chunks, you'd want a vector database like Pinecone or Weaviate to handle the search. For a personal site with around 160 chunks, looping through all of them in memory works fine.
Generating the Answer
At this point I have the top 8 chunks most relevant to the user's question. The last step is sending them to a language model to write a readable answer.
I went with Groq's free tier running Llama 3.1 8B. The model doesn't know anything about me by default. It only sees whatever chunks I send it. The system prompt tells it how to interpret the content and what the type prefixes mean:
const SYSTEM_PROMPT = `You are a helpful assistant on a personal website.
Answer questions using only the provided context.
Pay attention to type labels:
- [PROJECT]: Portfolio projects
- [WORK EXPERIENCE]: Employment history
- [BLOG POST]: Articles written
- [PROFILE]: Personal info
Keep answers concise and friendly. Do not make up information.`;
The API call:
const response = await fetch("https://api.groq.com/openai/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.GROQ_API_KEY}`,
},
body: JSON.stringify({
model: "llama-3.1-8b-instant",
messages: [
{ role: "system", content: SYSTEM_PROMPT },
...conversationHistory,
{ role: "user", content: `Context:\n${relevantChunks}\n\nQuestion: ${question}` },
],
temperature: 0.3,
}),
});
Temperature controls how creative the model gets. At 0.3 (out of 1.0), it stays close to the most likely answer, which is what you want when accuracy matters. Conversation history (the last 10 messages) goes in with each request so follow-up questions like "tell me more about that project" work without losing context.
Deploying to Vercel
At this point everything worked locally and I was ready to deploy and move on. The chatbot ran as a serverless function through Astro's Vercel adapter, the model was only 22MB, and the embeddings were a static JSON file. Should have been the easy part.
I deployed and immediately hit Vercel's 250MB size limit on serverless functions. The model is only 22MB, so that wasn't the issue. @huggingface/transformers depends on onnxruntime-node, which ships native binaries for every platform. They all get bundled into your function, and that alone pushes you way past 250MB.
There's a lighter alternative called onnxruntime-web that uses WebAssembly instead of native binaries, around 11MB. But it's built for browsers. Run it in Node.js and it tries to fetch WASM files from a CDN over HTTPS, which Node.js refuses to do.
The workaround: swap onnxruntime-node for onnxruntime-web with a pnpm override, copy the WASM files to a local directory during the build, and tell the runtime to load them from the filesystem instead of the CDN:
const wasmDir = join(process.cwd(), ".wasm");
onnxEnv.wasm.wasmPaths = {
wasm: `file://${wasmDir}/ort-wasm-simd-threaded.wasm`,
mjs: `file://${wasmDir}/ort-wasm-simd-threaded.mjs`,
};
onnxEnv.wasm.numThreads = 1;
With Vercel's includeFiles bundling the model and WASM into the function, the same local inference that works on my laptop works in production. No embedding API, no cold starts, no cost.
What It Costs
- Embedding a query: ~50ms
- Searching 164 chunks: under 1ms
- LLM response: ~400ms
- Total: under 500ms
Monthly cost: $0. Groq's free tier covers the LLM, embeddings run inside the serverless function, and chunk data is a static JSON file built at deploy time.
The whole thing is around 250 lines of TypeScript. There's a chat button on my site if you want to try it.
Originally published on akrom.dev. For quick dev tips, join @akromdotdev on Telegram.
Top comments (0)