Vishnu Sivan

Posted on Jun 18, 2023

Revolutionizing NLP Summarization with LangChain: Overcoming Challenges of Large Document Processing and Information Fusion

#langchain #nlp #python #summarization

In the realm of Natural Language Processing (NLP), summarizing extensive or multiple documents presents a formidable challenge. The sheer volume of data often leads to slower processing times and memory constraints, necessitating investments in high-performance computing infrastructure. However, the advent of Langchain offers an innovative solution by breaking down large documents into smaller chunks. By employing either simultaneous or serial processing, based on the chosen chain type, Langchain effectively eliminates the burden of dealing with the maximum token issue.

Beyond the details of handling vast amounts of information, combining insights from multiple documents to create a cohesive summary poses another obstacle. Terminology discrepancies, conflicting data, and varying aspects of the topic hinder effective fusion of content. Fortunately, Langchain comes to the rescue. By storing relevant information from previous documents within the current one, it establishes a comprehensive chain of interconnected documents. This intelligent approach empowers NLP summarization models to contextualize information, substantiate its importance, and maintain the proper order of sentences in the summarized content.

In this article, we will dive deeper into the groundbreaking capabilities of Langchain, exploring how it revolutionizes NLP summarization. We will develop a summarization app as a part of the tutorial to showcase the power of LangChain for summarizing the PDF contents.

Getting Started

What is LangChain
Components of LangChain
Text summarization using LangChain
How to create a PDF summarizer app using LangChain

What is LangChain

LangChain, an innovative framework developed by Harrison Chase, has emerged as a game-changer in the realm of Language Model (LLM) applications. Since its initial release as an open-source project in October 2022, LangChain has garnered significant attention, amassing an impressive 41,900 stars on GitHub and attracting a vibrant community of over 800 contributors.

LangChain acts as a bridge between LLM models, such as OpenAI and HuggingFace Hub, and external data sources like Google, Wikipedia, Notion, and Wolfram. By seamlessly connecting these resources, LangChain empowers developers to leverage a range of abstractions and tools. These include chains and agents, which provide structured workflows, as well as prompt templates, memory management, document loaders, and output parsers, all of which facilitate the smooth interaction between text input and output.

Components of LangChain

The seven key modules of LangChain are listed below:

Models: Integration of closed or open-source LLMs
Prompts: Template-based user input and output formatting for LLM models
Indexes: Structuring and preparation of data for optimal interaction with LLM models
Memory: Enabling chains or agents to retain short-term and long-term interactions with users
Chains: Combining multiple components or chains into a single pipeline for streamlined processing
Agents: Decision-making entities that utilize available tools and data based on input
Callbacks: Triggered functions to perform specific actions during LLM execution

Text summarization using LangChain

Let’s try to use langchain summarization chain for summarizing the textual content.

Setting up the Environment

Create and activate a virtual environment by executing the following command.

python -m venv venv
source venv/bin/activate #for ubuntu
venv/Scripts/activate #for windows

Install openai, langchain, tiktoken libraries using pip.

pip install openai langchain tiktoken

Creating OpenAI key

Openai key is required to access langchain. Follow the steps to create a new openai key.
Open platform.openai.com.
Click on your name or icon option which is located on the top right corner of the page and select “API Keys” or click on the link — Account API Keys — OpenAI API.
Click on create new secret key button to create a new openai key.

Building the app

Create a file main.py and add the following code to it.

from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.mapreduce import MapReduceChain
from langchain.docstore.document import Document
import textwrap

import os
import openai
os.environ["OPENAI_API_KEY"] = "your-openai-key"

llm = OpenAI(model_name="text-davinci-003")

text_splitter = CharacterTextSplitter()
with open("data.txt") as f:
    data = f.read()
texts = text_splitter.split_text(data)

docs = [Document(page_content=t) for t in texts[:3]]

chain = load_summarize_chain(llm, chain_type="map_reduce")
output_summary = chain.run(docs)

wrapped_text = textwrap.fill(output_summary, width=120)
print(wrapped_text)

Understanding the code:

Import the necessary libraries from langchain such as OpenAI for LLM, load_summarize_chain for summarization, CharacterTextSplitter for text splitting andMapReduceChain for summarizing the document.
Specify yourOPENAI_API_KEY in environment variables.
Initialize the LLM as OpenAI(model_name=”text-davinci-003")
Load the text file and split the text using split_text().
Summarize the text document using load_summarize_chain() chain.
Format the result using textwrap library.

Running the code

Note that I have created a text document with some information related to TCS from https://www.tata.com/business/tcs. You can use your own content for summarization. Run the main.py script using the following command.

python main.py

You will get the output as follows,

Summarization chains

LangChain provides various chain for summarizing the content. The most widely used chains are map_reduce, stuff and refine.

map_reduce

The map_reduce chain harnesses the power of initial prompts on individual data chunks to generate summaries or answers based solely on specific sections of a document. It goes a step further by employing a distinct prompt that combines the initial outputs, resulting in a comprehensive and cohesive summary or answer that spans the entire document. This approach showcases its ability to effortlessly handle even the most extensive and intricate documents, delivering efficient and accurate results.

stuff

The stuffing method involves including all relevant data as context in the prompt, allowing it to be passed to the language model. It is a straightforward approach that works effectively for handling a smaller piece of data. However, as the number of data pieces increases, this method becomes impractical and unsuitable.

refine

The refine method begins with an initial prompt on the first data chunk for generating an output. This output is then passed along with the subsequent document to the language model, instructing it to refine the output by incorporating the new information from the document. This iterative process allows for the gradual improvement and enhancement of the output as more documents are processed.

How to create a PDF summarizer app using LangChain

In the last section, we created a basic text summarization app using langchain summarization chains. In this section, we will be creating a summarizer app for summarizing the pdf files.

Installing dependencies

Install langchain, openai, gradio, pypdfand tiktoken libraries using pip.

pip install langchain openai gradio pypdf tiktoken

Importing required libraries

Create a script named main.pyusing the below code snippet. Import the necessary libraries and initialize the LLM to summarize the document.

import gradio as gr
from langchain import OpenAI, PromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader

import os
import openai
os.environ["OPENAI_API_KEY"] = "your-openai-key"

llm = OpenAI(temperature=0)

Defining the summarization function

The summarize_pdf function accepts a file path to a PDF document and utilizes the PyPDFLoader to load the content of the PDF. It further divides the content into smaller sections. Subsequently, the load_summarize_chain function is invoked to create a summarization chain. Finally, the generated chain is applied to the input text, resulting in the generation of a concise summary.

def summarize_pdf(path):
    summary = ""
    try:
        loader = PyPDFLoader(path.name)
        docs = loader.load_and_split()
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summary = chain.run(docs)
        prompt_template = """

        {text}

        SUMMARY:"""
        PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
        chain = load_summarize_chain(llm, chain_type="map_reduce", 
                                    map_prompt=PROMPT, combine_prompt=PROMPT)
    except:
        summary = "Something went wrong. \nPlease try with some other document."
    return summary

Setting up the user Interface using Gradio

In the main function, the Gradio Interface is established by configuring the input and output elements.

def upload_file(file):
    return file.name

def main():
    global input_pdf_path
    with gr.Blocks() as demo:
        file_output = gr.File()
        upload_button = gr.UploadButton("Click to Upload a File", file_types=["pdf"])
        upload_button.upload(upload_file, upload_button, file_output)

    output_summary = gr.Textbox(label="Summary")

    interface = gr.Interface(
        fn=summarize_pdf,
        inputs=[upload_button],
        outputs=[output_summary],
        title="PDF Summarizer",
        description="",
    )

    interface.launch()

if __name__ == "__main__":
    main()

Final code

The final code of the app is given below,

import gradio as gr
from langchain import OpenAI, PromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader

import os
import openai
os.environ["OPENAI_API_KEY"] = "your-openai-key"

llm = OpenAI(temperature=0)

def summarize_pdf(path):
    summary = ""
    try:
        loader = PyPDFLoader(path.name)
        docs = loader.load_and_split()
        chain = load_summarize_chain(llm, chain_type="map_reduce")
        summary = chain.run(docs)
        prompt_template = """

        {text}

        SUMMARY:"""
        PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
        chain = load_summarize_chain(llm, chain_type="map_reduce", 
                                    map_prompt=PROMPT, combine_prompt=PROMPT)
    except:
        summary = "Something went wrong. \nPlease try with some other document."
    return summary

def upload_file(file):
    return file.name

def main():
    global input_pdf_path
    with gr.Blocks() as demo:
        file_output = gr.File()
        upload_button = gr.UploadButton("Click to Upload a File", file_types=["pdf"])
        upload_button.upload(upload_file, upload_button, file_output)

    output_summary = gr.Textbox(label="Summary")

    interface = gr.Interface(
        fn=summarize_pdf,
        inputs=[upload_button],
        outputs=[output_summary],
        title="PDF Summarizer",
        description="",
    )

    interface.launch()

if __name__ == "__main__":
    main()

Run the app

Run the app using the following command,

python main.py

Open the URL specified in the terminal to see the output. For summarization, a research paper titled Generative AI: Perspectives from Stanford HAI from Generative_AI_HAI_Perspectives.pdf (stanford.edu) is used.

There you have it! Your first langchain based summarizer app in python :)

Thanks for reading this article.

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the heart button ♥ and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/langchain-pdf-summarizer

The article is also available on Medium.

Here are some useful links:

DEV Community