Daniel Malek

Posted on Mar 3

Building a Knowledge Base with RAG Using NestJS, LangChain and OpenAI

#nestjs #langchain #rag #ai

Source code: github.com/Dan1618/Articles-rag

1. What We're Building and Why

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Model responses by grounding them in external data. Instead of relying solely on the model's training data, RAG retrieves relevant information from your own curated sources and injects it into the prompt, producing answers that are more accurate and up-to-date.
In this project we build a system where you can save articles from the web into a vector database and then ask questions about their content through a chat-like interface. Think of it as assembling a personal knowledge base: every article you feed in becomes searchable context for future queries.
Could you just point a bot at a live website each time? Sure — but by persisting the data in a vector store you are building a knowledge base that grows over time. There is nothing stopping you from combining both approaches, or extending the pipeline to ingest PDFs and other document types as well.

The Stack

Backend framework - NestJS

Templating / UI - Handlebars

LLM orchestration - LangChain

Vector store - FAISS (local)

Embeddings & chat model - OpenAI API

LangChain is an open-source framework that simplifies building applications powered by language models. It provides ready-made abstractions for document loading, text splitting, embedding, vector storage, and chaining LLM calls together — so you can focus on your application logic rather than low-level plumbing.

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search over dense vectors. We use it here as a local vector store, which is simpler to set up and demonstrate than a hosted vector database, while still being fast enough for production-grade similarity lookups.

The UI

The user interface includes three input fields for links, a field for asking related questions, and buttons to save articles to FAISS or generate answers.

2. Technical Implementation

The application is split into two main services:

IngestService — loads articles from the web, splits them into chunks, creates embeddings, and saves them to the FAISS index.
AppService — loads the saved FAISS index, retrieves relevant chunks for a given question, and runs a Map-Reduce QA chain to produce an answer.

2.1 Setup — NestJS + Handlebars

NestJS gives us a structured, modular backend with dependency injection out of the box. Handlebars is wired in as the view engine so we can serve a lightweight chat-style UI without pulling in a full frontend framework. The two services above are standard NestJS @Injectable() providers.

2.2 Ingesting Articles

The IngestService.ingest() method handles the entire pipeline from raw URL to searchable vector store:

@Injectable()
export class IngestService {
    private readonly directory = 'faiss_store';
    private readonly urlsFile = path.join(this.directory, 'urls.json');

    async ingest(urls: string[]) {
        console.log('Building new FAISS store...');
        const embeddings = new OpenAIEmbeddings();

        // 1. Load data from each URL using Cheerio
        const docs = [];
        for (const url of urls) {
            const loader = new CheerioWebBaseLoader(url);
            const loadedDocs = await loader.load();
            docs.push(...loadedDocs);
        }

        // 2. Split documents into chunks
        const textSplitter = new RecursiveCharacterTextSplitter({
            separators: ['\n\n', '\n', '.', ','],
            chunkSize: 1000,
        });
        const splitDocs = await textSplitter.splitDocuments(docs);

        // 3. Create embeddings and persist the FAISS index
        const vectorStore = await FaissStore.fromDocuments(splitDocs, embeddings);
        await vectorStore.save(this.directory);
        fs.writeFileSync(this.urlsFile, JSON.stringify(urls));
    }
}

Let's walk through the key stages.

Loading — `CheerioWebBaseLoader`

CheerioWebBaseLoader is a LangChain document loader that fetches a web page and extracts its text content using the Cheerio HTML parser. Each URL becomes a Document object containing the page text and metadata.

Splitting — Why Chunks Matter

LLMs have a finite context window — the maximum number of tokens a model can process in a single request. If a model has a 4,000-token limit and your prompt already uses 3,500 tokens, only 500 tokens are left for the completion. A full article easily exceeds that budget, so we need to split it into smaller chunks that fit comfortably inside the context window.

RecursiveCharacterTextSplitter handles this by trying a hierarchy of separators (\n\n, \n, ., ,) to find natural break points. We set chunkSize: 1000 to keep each chunk under 1,000 characters.

Overlap Chunks

RecursiveCharacterTextSplitter also supports an overlap option (chunkOverlap), which allows adjacent chunks to share a portion of text at their boundaries. Think of it like the "previously on…" recap at the start of a TV episode followed by the "coming up next" teaser at the end — it ensures that context isn't lost at the seams between chunks.

Embedding & Saving — `FaissStore.fromDocuments`

FaissStore.fromDocuments(splitDocs, embeddings) sends each chunk to the OpenAI Embeddings API, converts the text into high-dimensional vectors, and indexes them in a FAISS store. The resulting index is then saved to disk with vectorStore.save(), so it can be reloaded later without re-embedding.

It is worth highlighting here that to vectorize the data, the app connects to OpenAI API, using it is not free (although for some older models like 'gpt-4o-mini' it is still quite cheap). If you want to use this application please remember to include your OpenAI API key in .env file. You can see the logs of connecting to OpenAI in the console, similar process will take place when retrieving the data from the the FAISS index when answering questions.

2.3 Answering Questions

Once the articles are ingested, the AppService.answerQuestion() method handles the retrieval and answering:

async answerQuestion(question: string) {
    const embeddings = new OpenAIEmbeddings();
    const vectorStore = await FaissStore.load(this.directory, embeddings);

    const llm = new ChatOpenAI({
        modelName: 'gpt-4o-mini',
        temperature: 0.7,
        maxTokens: 1000,
    });

    const chain = loadQAChain(llm, { type: 'map_reduce' });

    // Retrieve the most relevant chunks
    const retriever = vectorStore.asRetriever();
    const retrievedDocs = await retriever.invoke(question);

    // Run the QA chain
    const result = await chain.invoke({
        input_documents: retrievedDocs,
        question: question,
    });

    // Get the sources
    const sources = Array.from(new Set(retrievedDocs.map(doc => doc.metadata.source)));

    return { status: 'done', answer: result.text, sources: sources };
}

Step by step:

Load the vector store — FaissStore.load() reads the previously saved index from disk.
Create the LLM — We use gpt-4o-mini with a temperature of 0.7 for a good balance of creativity and accuracy.
Retrieve relevant chunks — vectorStore.asRetriever() returns a retriever that performs a similarity search. When we call retriever.invoke(question), it embeds the question and finds the most similar chunks in the FAISS index.
Run the Map-Reduce chain — The retrieved documents and the question are passed into the chain, which produces the final answer.

2.4 Map-Reduce: Reassembling the Chunks

When we split an article into chunks for ingestion, we eventually need a strategy to recombine those chunks when answering a question. This is where the Map-Reduce pattern comes in.

In the context of LLM applications, Map-Reduce operates in two phases:

Map — Each retrieved chunk is sent to the LLM individually. The model extracts or summarizes only the information relevant to the question, producing a filtered chunk (FC). This step runs in parallel, and its primary goal is to reduce the size of each chunk down to the essential content.
Reduce — All the filtered chunks are combined into a single summary, which is then sent to the LLM along with the original question to produce the final answer.

LangChain's loadQAChain with type: 'map_reduce' wires this up for you. Under the hood it uses two sub-chains:

An LLM chain that processes each individual document (the Map step).
A combine documents chain that merges the Map outputs into one cohesive input for the final LLM call (the Reduce step).

The "Stuff" Optimization

Although in this implementation we set map_reduce explicitly, LangChain includes an internal optimization: if the total retrieved text (all chunks combined) is smaller than the LLM's context window or a pre-defined token_max limit, the chain detects that it is cheaper and faster to skip the Map phase entirely.

Instead of performing multiple Map calls followed by one Reduce call, it simply "stuffs" all the documents into a single prompt and makes one LLM call. This automatic fallback saves both time and API cost when the input is small enough to fit.

3. Summary and Ideas for the Future

We've built a RAG pipeline that ingests articles from the web, stores them in a local FAISS vector store, and answers questions using a Map-Reduce QA chain powered by OpenAI. The key takeaways:

RAG grounds LLM responses in your own data, making answers more accurate and verifiable.
Chunking with overlap preserves context across split boundaries.
Map-Reduce elegantly handles cases where retrieved content exceeds the context window, with an automatic "Stuff" fallback for smaller inputs.
FAISS provides a zero-infrastructure vector store that is perfect for demos and small-to-medium workloads.

Where to Go from Here

Add PDF and file ingestion — extend the loader to support PDFLoader, TextLoader, and other LangChain document loaders for a richer knowledge base.
Persistent hosted vector store — migrate from local FAISS to a managed solution like Pinecone, Weaviate, or Qdrant for multi-user, production-grade deployments.
Streaming responses — use LangChain's streaming callbacks to deliver answers token-by-token for a more responsive chat experience.
Hybrid retrieval — combine vector similarity search with keyword-based (BM25) retrieval for better recall on exact-match queries.

DEV Community

Building a Knowledge Base with RAG Using NestJS, LangChain and OpenAI