Building a Lyzr-powered Image-to-Text PDF Chatbot

In this tutorial, we'll walk through the process of building a simple chatbot using Python and Streamlit that can extract text from images embedded in PDF files. This chatbot leverages the power of OpenAI's GPT (Generative Pre-trained Transformer) models to generate text descriptions based on the content of images.

Why LYZR?
Lyzr stands out as the most user-friendly framework for swiftly constructing and deploying Generative AI applications. Embracing an 'agentic' approach, Lyzr simplifies the process compared to Langchain's function and chain methodology and DSPy's more programmatic approach. Unlike its counterparts, Lyzr's SDKs don't demand a profound understanding of underlying complexities. With just a few lines of code, users can swiftly craft their GenAI applications, streamlining the development process significantly.

Introduction
In today's world, extracting information from images is becoming increasingly important. Whether it's analyzing resumes, extracting data from documents, or understanding visual content, automated solutions can save time and effort. In this tutorial, we'll combine image processing techniques with natural language processing (NLP) to create a Chatbot for PDF capable of extracting text from images within PDF files.

Setting Up the Environment
To get started, we'll need a few Python libraries installed, including Streamlit for building the user interface, PyMuPDF for working with PDF files, and OpenAI for leveraging powerful language models.

import base64
import streamlit as st
import os
import fitz
from openai import OpenAI
from io import BytesIO
from PIL import Image
import tempfile
from lyzr import ChatBot
import shutil

Extracting Images from PDFs
The first step is to extract images from the PDF file uploaded by the user. We'll use PyMuPDF to parse the PDF and extract images from each page.

def extract_images(path, output_dir):
    pdf_path = os.path.join('data', path)
    print(pdf_path)
    pdf_document = fitz.open(pdf_path)
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        image_list = page.get_images(full=True)
        for image_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"{os.path.join(output_dir, os.path.splitext(os.path.basename(pdf_path))[0])}_page{page_number + 1}_image{image_index + 1}.{image_ext}"
            with open(image_filename, "wb") as image_file:
                image_file.write(image_bytes)

    pdf_document.close()

def remove_existing_files(directory):
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print(e)

data_directory = "data"
os.makedirs(data_directory, exist_ok=True)
remove_existing_files(data_directory)

data_image_dir = "data/image"
os.makedirs(data_image_dir, exist_ok=True)
remove_existing_files(data_image_dir)
uploaded_file = st.file_uploader("Choose PDF file", type=["pdf","jpeg","png"])
print(uploaded_file)

Above code is Removing all files present in out data directory and image directory.
This streamlit code can take Pdf file from user.

if uploaded_file is not None:
    # Save the uploaded PDF file to the data directory
    file_path = os.path.join(data_directory, uploaded_file.name)
    with open(file_path, "wb") as file:
        file.write(uploaded_file.getvalue())

    # Display the path of the stored file
    st.success(f"File successfully saved")

    extract_images(uploaded_file.name, data_image_dir)

Encoding Images
Next, we'll encode the extracted images into base64 format. This encoding allows us to embed the images directly into the chatbot interface.

def encode_image(image_path, max_image=512):
    with Image.open(image_path) as img:
        width, height = img.size
        max_dim = max(width, height)
        if max_dim > max_image:
            scale_factor = max_image / max_dim
            new_width = int(width * scale_factor)
            new_height = int(height * scale_factor)
            img = img.resize((new_width, new_height))

        buffered = BytesIO()
        img.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
        return img_str

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as temp_file:
    temp_file_path = temp_file.name

Generating Text from Images
Using OpenAI's GPT model, we'll generate text descriptions for each image. The chatbot prompts the user for a description of the image's content and layout.

def generate_text(image_file):
    client = OpenAI()

    max_size = 512  # set to maximum dimension to allow (512=1 tile, 2048=max)
    encoded_string = encode_image(image_file, max_size)

    system_prompt = ("You are an expert at analyzing images with computer vision. In case of error, "
     "make a full report of the cause of: any issues in receiving, understanding, or describing images")
    user = ("Describe the contents and layout of my image.")

    apiresponse = client.chat.completions.with_raw_response.create(
        model="gpt-4-vision-preview",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user},
                    {
                        "type": "image_url",
                        "image_url": {"url":
                            f"data:image/jpeg;base64,{encoded_string}"},
                    },
                ],
            },
        ],
        max_tokens=500,
    )
    debug_sent = apiresponse.http_request.content
    chat_completion = apiresponse.parse()
    text = chat_completion.choices[0].message.content

    file1 = open(temp_file_path, "w")
    file1.writelines(text)

Building the Chat Interface
We'll create a simple user interface using Streamlit, allowing users to upload PDF files and ask questions about the extracted text.

st.title("Lyzr Image and Text PDF Chatbot")

Use Lyzr SDK's txt_chat chatbot
text_chat function within Lyzr's ChatBotmodule provides a robust solution for building chatbots tailored to interact with text documents.

def rag_image_chat(file_path):
    vector_store_params = {
        "vector_store_type": "WeaviateVectorStore",
        "index_name": "Lyzr_c"  # first letter should be capital
    }

    chatbot = ChatBot.txt_chat(
        input_files=[str(file_path)],
        vector_store_params=vector_store_params,
    )
    return chatbot

Use Lyzr SDK's pdf_chat chatbot
pdf_chat function within Lyzr's ChatBotmodule provides a robust solution for building chatbots tailored to interact with text documents.

def rag_pdf_chat(file_path):
    vector_store_params = {
        "vector_store_type": "WeaviateVectorStore",
        "index_name": "Lyzr_c"  # first letter should be capital
    }

    chatbot = ChatBot.pdf_chat(
        input_files=[str(file_path)],
        vector_store_params=vector_store_params
    )

    return chatbot

for image in path:
    generate_text(image)
if uploaded_file is not None:
    rag1=rag_pdf_chat(os.path.join('data', uploaded_file.name))

if uploaded_file is not None:
    question = st.text_input("Ask a question about the resume:")
    if st.button("Get Answer"):
        rag = rag_image_chat(temp_file_path)
        response =rag.chat(question)
        st.markdown(f"""{response.response}""")

Conclusion
In this tutorial, we've demonstrated how to build a Chatbot for PDF capable of extracting text from images within PDF files using the Lyzr SDK. By combining image processing techniques with natural language generation, we've created a powerful tool for analyzing visual content and answering user queries. This chatbot has various applications, from resume analysis to document summarization, making it a versatile solution for text extraction tasks.

For more information explore the website: Lyzr
Image and Text PDF Chatbot - Github

Top comments (1)

Vinayak Mishra • Jan 15 '25

Interesting! Would love to know if the chatbot hallucinates often? This question popped up in my mind as yesterday night I was going through this blog on LLM hallucination detection