DEV Community

Cover image for Using LangChain to Search Your Own PDF Documents
Alcione Paiva
Alcione Paiva

Posted on

Using LangChain to Search Your Own PDF Documents

Artificial Intelligence applications like OpenAI's ChatGPT or Google's Gemini enable users to explore a wide range of topics and ask questions with ease. However, there are situations where the information we seek is not readily accessible to these tools but resides in private or less accessible documents. Even in such cases, these applications can leverage their advanced language processing capabilities to analyze these documents, extract relevant information, and provide targeted answers—eliminating the need to manually read through the entire content.

Using a language model to search for information outside of its training base is one of the applications of a technique called RAG (Retrieval-Augmented Generation). In this post, we will show how it's possible to easily create an application to search through local documents. In our example, we will use a PDF document, but the example can be adapted for various types of documents, such as TXT, MD, JSON, etc. To assist us in building our example, we will use the LangChain library.

LangChain is a powerful open-source framework that simplifies the construction of natural language processing (NLP) pipelines using large language models (LLMs). LangChain stands out for its ability to build complex process chains, combining different stages of text manipulation and data processing in a modular and scalable manner.

As a development environment, we will use Google Colab Notebook. The notebook can be viewed at this link.

Step 1 - Download the PDF Document

To begin, we'll need to download the PDF document that we want to process and analyze using the LangChain library. In our example, we will use a document from the GLOBAL FINANCIAL STABILITY REPORT conducted by the International Monetary Fund. In the Colab Notebook, the document can be downloaded with the following command:

!wget https://www.imf.org/-/media/Files/Publications/GFSR/2024/April/English/text.ashx -O text.pdf
Enter fullscreen mode Exit fullscreen mode

Step 2 - Install the Libraries

Next, we need to install the necessary libraries using pip. In the Google Colab Notebook, you can install these libraries by running the following commands:

!pip install langchain
!pip install -U langchain-community
!pip install -U langchain-openai
!pip install chromadb
!pip install pypdf2
Enter fullscreen mode Exit fullscreen mode

Here is the explanation of each library:

LangChain:
LangChain is the main library for building natural language processing (NLP) pipelines using large language models (LLMs). This library facilitates the integration of different stages of text manipulation and data processing, enabling the creation of advanced NLP applications.

LangChain-Community:
LangChain-Community is an extension of the LangChain library that includes additional modules and functionalities developed by the community. This extension allows users to benefit from contributions and improvements made by other developers, expanding the capabilities and functionalities available.

LangChain-OpenAI:
LangChain-OpenAI is a specific module for integration with OpenAI's language models, such as GPT-3 and GPT-4. This package allows developers to efficiently use the OpenAI API within the LangChain ecosystem, facilitating the construction of pipelines involving OpenAI's powerful language models.

ChromaDB:
ChromaDB is a database library designed for the efficient storage and management of data as vectors. It is important because textual elements are represented in the form of numeric vectors (embeddings) for analysis by the language model. ChromaDB facilitates the retrieval and manipulation of these vectors for tasks such as search and information retrieval.

PyPDF2:
PyPDF2 is a Python library that enables reading, manipulating, and extracting text from PDF files. This library is essential when working with PDF documents in NLP applications, allowing you to load and process the content of PDFs programmatically.

Step 3 - Import the Modules to Be Used

from pprint import pprint
import PyPDF2
import os
from google.colab import userdata

import openai
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
Enter fullscreen mode Exit fullscreen mode

The pprint module (short for pretty-print) is used to format complex data structures in a way that is more readable and organized for humans. The os module will be used to store the API key value in an environment variable for accessing the OpenAI API. userdata is used in Google Colab to access and manipulate user data, facilitating operations that involve exchanging data between the notebook and the user's Colab environment. In this case, we will use it to obtain the OpenAI API key, which should be registered in the Colab secrets space. The other modules will be explained at the time of their use.

Step 4 - Reading the Document and Converting it to Text

In the code snippet below, a PDF file is read, the text contained in each of its pages is extracted, and a portion of this text is displayed.

# Open the PDF file
file_path = "./text.pdf"
pdf_file = open(file_path, "rb")

# Create an Object to read the PDF
pdf_reader = PyPDF2.PdfReader(pdf_file)

# Extract text from each page
pdf_text = ""
for page_num in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[page_num]
    pdf_text += page.extract_text()

# Close PDF
pdf_file.close()

# Shows an excerpt of the text read
pdf_text[:2000]
Enter fullscreen mode Exit fullscreen mode

Below are the first 2000 characters of the text extracted from the PDF (pdf_text[:2000]).

Image description

Step 5 - Splitting the Text

The next step is to split the text before vectorizing it, that is, before converting the words into vectors. Splitting is important because language models, especially those based on Transformers like BERT, GPT, etc., have a limit on the number of tokens (words or characters) they can process at once. Long texts that exceed this limit need to be divided into smaller parts to be processed correctly. Additionally, dividing the text into smaller parts allows each segment to maintain coherent context. If a text is too long and not split, the model might lose context or ignore important parts of the text. By splitting it into segments, we ensure that each part is meaningful and comprehensible on its own.
In our example, we will use LangChain's RecursiveCharacterTextSplitter. It is designed to split the text into smaller, coherent, and meaningful pieces.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(pdf_text)
Enter fullscreen mode Exit fullscreen mode

In the example, we use a chunk_size of 1000, defining the maximum size of each segment, and a chunk_overlap of 100, defining the number of characters that overlap between consecutive segments. The overlap helps to maintain context between the segments.

Step 6 - Text Vectorization

The code snippet below sets up an environment to use the OpenAI API and creates a vector database using ChromaDB to store text embeddings.

os.environ['OPENAI_API_KEY'] = userdata.get("OPENAI_API_KEY")
persist_directory = 'db'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_texts(texts=texts,  embedding=embedding,  persist_directory=persist_directory)
Enter fullscreen mode Exit fullscreen mode

os.environ: A dictionary in Python that contains the system's environment variables.
userdata.get("OPENAI_API_KEY"): Retrieves the OpenAI API key from the user data registered in Google Colab's secrets.
persist_directory: Sets the path of the directory where persistent data will be stored. In this case, it is set as 'db'.
OpenAIEmbeddings(): Creates an instance of embeddings (vectors) provided by OpenAI.
Chroma.from_texts(...): A method from the ChromaDB library that creates a vector database from a list of texts.
texts=texts: Passes the list of texts that will be converted into vectors and stored in the database.
embedding=embedding: Specifies the OpenAI embeddings object to be used for converting the texts into vector representations.
persist_directory=persist_directory: Sets the directory where the database will be saved and persisted.

Step 7 - Create an Object for Querying

Now we will create an object to query the text. In the code snippet below, an instance of RetrievalQA is created using a specific chain type. RetrievalQA is a class used to answer questions based on an index of documents. It is used to set up a question-and-answer system that combines information retrieval capabilities with a large language model.

qa = RetrievalQA.from_chain_type(llm=OpenAI(),
    chain_type="stuff", retriever=vectordb.as_retriever())
Enter fullscreen mode Exit fullscreen mode

The from_chain_type(...) method of this class creates an instance based on the specified chain type.

Arguments:
llm=OpenAI(): Creates an instance of the OpenAI language model. This instance will be used to generate responses based on the retrieved text.
OpenAI(): This command calls the class or function that creates a connection with the OpenAI language model, using the previously configured API key.
chain_type="stuff": The chain type is set to "stuff." This indicates how the retrieved documents will be combined to form the final answer. In the case of "stuff," the documents are simply concatenated.
retriever=vectordb.as_retriever(): vectordb is a vector database being used to retrieve relevant documents. The as_retriever() method transforms this database into an object that can be used to search for documents.

Step 8 - Conducting the Search

In this final step, we perform the query. In this case, we will ask to "Analyze cyber incidents in the current context." Remember to ask politely.

query = "Please, analyze cyber incidents in the current context."
response = qa.invoke(query)
pprint(response['result'])
Enter fullscreen mode Exit fullscreen mode

response = qa.invoke(query): This line uses the qa object (created in the previous code) to search for the answer to the question. The invoke() method takes the question as a parameter and returns a response in the variable response.

pprint(response['result']): This line prints the answer stored in the result key of the response dictionary. The pprint() function formats the output to make it easier to read, by indenting and aligning the text.
Below is the output issued by the query:

(' Cyber incidents, including cyber attacks, have increased almost doubled '
 'since before the COVID-19 pandemic. However, the total number of incidents '
 'and losses may still be underestimated due to factors such as lag in '
 'reporting and concerns about reputation. Improved reporting and data '
 'collection are needed, and supervisors should require firms to have response '
 'and recovery procedures in place. Ongoing digital transformation and '
 'technological innovation, as well as geopolitical tensions, exacerbate the '
 'risk of cyber incidents. Recent significant incidents, such as a ransomware '
 'attack on a major Chinese bank, highlight the potential impact of cyber '
 'incidents on financial stability. ')
Enter fullscreen mode Exit fullscreen mode

We have reached the end of this Post. If this Post was useful to you, please consider leaving a comment.

Top comments (0)