LangChain.js: Chatting with a PDF

#webdev #ai #beginners #tutorial

To understand this article, please visit first, the fundamentals of LangChain here. A pre-requisite to understand the below code.

LangChain has a library for JavaScript, which helps you build applications powered by LLMs in the same way as in Python. Below, let us go through the steps in creating an LLM powered app with LangChain.js, JavaScript, and Gemini-Pro.

Pre-requisites:

Install LangChain npm install -S langchain
Google API Key
LangChain Module npm install @langchain/community
LangChain Google Module npm install @langchain/google-genai

Step 1: Loading and Splitting the Data

The initial step is to load the source document, in our case a PDF and splitting the document's data into smaller chunks, so that our LLM can easily process it.

import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

export async function loadAndSplitChunks({
    chunkSize,
    chunkOverlap
}) {
    //Initialize the PDF Loader
    const loader = new PDFLoader("./files/drylab.pdf");

    // Load the PDF file as text
    const rawCS229Docs = await loader.load();

    // Split the text by different characters
    // until it finds a character
    const splitter = new RecursiveCharacterTextSplitter({
        chunkSize,
        chunkOverlap,
    });

    // Splits the text into chunks
    const splitDocs = await splitter.splitDocuments(rawCS229Docs);
    return splitDocs;
}

// Calling the function
// to load the pdf file and split the document
const splitDocs = await loadAndSplitChunks({
    chunkSize: 1536,
    chunkOverlap: 128,
});

Technical Terms:

Chunk Size: It refers to the size or length of each individual chunk.
Chunk Overlap: It refers to the amount of overlap between consecutive chunks of text. The purpose of chunk overlap is to ensure that important context is not lost when splitting long texts into smaller chunks.

Step 2: Initialize and Load the Vector Store

Now, that we have split the document into chunks, convert them into embeddings and store them into an in-memory vector store.

import { GoogleGenerativeAIEmbeddings } from "@langchain/google-genai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import dotenv from "dotenv";
dotenv.config();

export async function initializeVectorstoreWithDocuments({
    documents
}) {
    // Initialize your integrtion's embeddings
    const embeddings = new GoogleGenerativeAIEmbeddings({
        modelName: "embedding-001", // 768 dimensions
        taskType: TaskType.RETRIEVAL_DOCUMENT,
        title: "Retrieval Document",
        apiKey: process.env.google_api_key
    });;

    // initialize your vector store
    const vectorstore = new MemoryVectorStore(embeddings);
    // add the chunks to the vector store
    await vectorstore.addDocuments(documents);
    return vectorstore;
}

//pass the chunks of data to the vector store
const vectorstore = await initializeVectorstoreWithDocuments({
    documents: splitDocs,
});

Technical Terms:

Embeddings: Numerical representation of words, sentences or documents that capture it's semantic meaning.
Memory Vector Store: It is an in-memory vectorstore that stores embeddings in-memory and does an exact, linear search for the most similar embeddings.

Step 3: Retrieving the document

The retrieval part has 3 main steps which are executed in sequence:

input: The value of input is got in our case from a field called question
retriever: This is used for document retrieval
convertDocsToString: The retrieved documents are then converted into strings

import { RunnableSequence } from "@langchain/core/runnables";


export function createDocumentRetrievalChain() {
    // convert the documents and adds <doc> tags 
    // and concatenates them as strings.
    const convertDocsToString = (documents) => {
        return documents.map((document) => `<doc>\n${document.pageContent}\n</doc>`).join("\n");
    };

    // Each of the runnables mentioned will be executed in sequence
    const documentRetrievalChain = RunnableSequence.from([
        (input) => input.standalone_question,
        retriever,
        convertDocsToString,
    ]);

    return documentRetrievalChain;
}

// retrieve the document from the vectorstore
const retriever = vectorstore.asRetriever();

// calling the retriever module
const documentRetrievalChain = createDocumentRetrievalChain();