DEV Community: Shuveb Hussain

Checkbox Extraction from PDFs - A Tutorial

Shuveb Hussain — Tue, 16 Jul 2024 15:15:34 +0000

This is a companion article to the YouTube video where you can see a live coding session in which we extract raw text and structure data from a PDF form that has checkboxes and radiobuttons.

One challenge when it comes to processing PDF forms is the need to extract and interpret PDF form elements like checkboxes and radio buttons. Without this, automating business processes that involve these forms just won’t be possible. And in the wild, these PDF forms can either be native PDFs or just scans of paper documents. Or worse, smartphone-clicked images of paper forms. In this article, we’ll look at how to both extract and interpret PDF documents that contain forms in native text or scanned image formats.

Before you run the project

The source code for the project can be found here on GitHub. To successfully run the extraction script, you’ll need 2 different API keys. One for LLMWhisperer and the other for OpenAI APIs. Please be sure to read the Github project’s README to fully understand OS and other dependency requirements. You can sign up for LLMWhisperer, get your API key, and process up to 100 pages per day free of charge.

Our approach to PDF checkbox extraction

Because these documents can come in a wide variety of formats and structures, we will use a Large Language Model in order to interpret and structure the contents of those PDF documents. However, before that, we’ll need to extract raw text from the PDF (irrespective of whether it’s native text or scanned).

If you carefully think about it, the system that extracts raw text from the PDF needs to both detect and render PDF form elements like checkboxes and radiobuttons in a way that LLMs can understand. In this example, we’ll use LLMWhisperer to extract PDF raw text representing checkboxes and radiobuttons. You can use LLMWhisperer completely free for processing up to 100 pages per day. As for structuring the output from LLMWhisperer, we’ll use GPT3.5-Turbo and we’ll use Langchain and Pydantic to help make our job easy.

Input and expected output

Let’s take a look at the input document and what we expect to see as output.

The PDF document
We want to extract structure information from this 1003 form, but in this exercise, we’ll only be processing the first page.

Extracted raw text via LLMWhisperer
Let’s look at the output from LLMWhisperer

Two things should jump out:

LLMWhisperer preserves the layout of the input document in the output! This makes it easy for LLMs to get a good idea about the column layout and what different sections mean
LLMWhisperer has rendered checkboxes and radio buttons as simple plain text! This allows LLMs to interpret the document as the user has filled it.

What we want the final JSON to look like
We want to output structured JSON using Pydantic and output something like the following so that it’s easy to process downstream. With LLMWhisperer giving us output—especially around checkboxes and radio buttons—it should be easy for us to get exactly what we want.

{
    "personal_details": {
        "name": "Amy America",
        "ssn": "500-60-2222",
        "dob": "1954-03-30T00:00:00Z",
        "citizenship": "U.S. Citizen"
    },
    "extra_details": {
        "type_of_credit": "Individual",
        "marital_status": "Married",
        "cell_phone": "(408) 111-2121"
    },
    "current_address": {
        "street": "4321 Cul de Sac ST",
        "city": "Los Angeles",
        "state": "CA",
        "zip_code": "90210",
        "residing_in_addr_since_years": 10,
        "residing_in_addr_since_months": 2,
        "own_house": true,
        "rented_house": true,
        "rent": 2200,
        "mailing_address_different": false
    },
    "employment_details": {
        "business_owner_or_self_employed": true,
        "ownership_of_25_pct_or_more": true
    }
}

The Code for checkbox extraction

We’re able to achieve our objectives in about 100 lines of code. Let’s look at the different aspects of it.

Extracting raw text from the PDF
Using LLMWhisperer’s Python client, we can extract data from PDF documents as needed. LLMWhisperer is a cloud service and requires an API key, which you can get for free. LLMWhisperer’s free plan allows you to extract up to 100 pages of data per day, which is more than we need for this example.

def extract_text_from_pdf(file_path, pages_list=None):
    llmw = LLMWhispererClient()
    try:
        result = llmw.whisper(file_path=file_path, pages_to_extract=pages_list)
        extracted_text = result["extracted_text"]
        return extracted_text
    except LLMWhispererClientException as e:
        error_exit(e)

Just calling the whisper() method on the client, we’re able to extract raw text from images, native text PDFs, scanned PDFs, smartphone photos of documents, etc.

Defining the schema
To use Pydantic with Langchain, we define the schema or structure of the data we want to extract from the unstructured source as Pydantic classes. For our structuring use case, this is how it looks like:

class PersonalDetails(BaseModel):
    name: str = Field(description="Name of the individual")
    ssn: str = Field(description="Social Security Number of the individual")
    dob: datetime = Field(description="Date of birth of the individual")
    citizenship: str = Field(description="Citizenship of the individual")


class ExtraDetails(BaseModel):
    type_of_credit: str = Field(description="Type of credit")
    marital_status: str = Field(description="Marital status")
    cell_phone: str = Field(description="Cell phone number")


class CurrentAddress(BaseModel):
    street: str = Field(description="Street address")
    city: str = Field(description="City")
    state: str = Field(description="State")
    zip_code: str = Field(description="Zip code")
    residing_in_addr_since_years: int = Field(description="Number of years residing in the address")
    residing_in_addr_since_months: int = Field(description="Number of months residing in the address")
    own_house: bool = Field(description="Whether the individual owns the house or not")
    rented_house: bool = Field(description="Whether the individual rents the house or not")
    rent: float = Field(description="Rent amount")
    mailing_address_different: bool = Field(description="Whether the mailing address is different from the current "
                                                        "address or not")


class EmploymentDetails(BaseModel):
    business_owner_or_self_employed: bool = Field(description="Whether the individual is a business owner or "
                                                              "self-employed")
    ownership_of_25_pct_or_more: bool = Field(description="Whether the individual owns 25% or more of a business")


class Form1003(BaseModel):
    personal_details: PersonalDetails = Field(description="Personal details of the individual")
    extra_details: ExtraDetails = Field(description="Extra details of the individual")
    current_address: CurrentAddress = Field(description="Current address of the individual")
    employment_details: EmploymentDetails = Field(description="Employment details of the individual")

Form1003 is our main class, which refers to other classes that hold more information from subsections of the form. One main thing to note here is how each field includes a description in plain English. These descriptions, along with the type of each variable are then used to construct prompts for the LLM to structure the data we need.

Constructing the prompt and calling the LLM
The following code uses Langchain to let us define the prompts to structure data from the raw text like we need it. The process_1003_information() function has all the logic to compile our final prompt from the preamble, the instructions from the Pydantic class definitions along with the extracted raw text. It then calls the LLM and returns the JSON response. That’s pretty much all there is to it. This code should give us the output we need as structured JSON.

def process_1003_information(extracted_text):
    preamble = ("What you are seeing is a filled out 1003 loan application form. Your job is to extract the "
                "information from it accurately.")
    postamble = "Do not include any explanation in the reply. Only include the extracted information in the reply."
    system_template = "{preamble}"
    system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)
    human_template = "{format_instructions}\n\n{extracted_text}\n\n{postamble}"
    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

    parser = PydanticOutputParser(pydantic_object=Form1003)
    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
    request = chat_prompt.format_prompt(preamble=preamble,
                                        format_instructions=parser.get_format_instructions(),
                                        extracted_text=extracted_text,
                                        postamble=postamble).to_messages()
    chat = ChatOpenAI()
    response = chat(request, temperature=0.0)
    print(f"Response from LLM:\n{response.content}")
    return response.content

Conclusion

When dealing with forms in native or scanned PDFs, it is important to employ a text extractor that can both recognize form elements like checkboxes and radiobuttons and render them in a way that LLMs can interpret.

In this example, we saw how a 1003 form was processed taking into account various sections that made meaning only if the checkboxes and radiobuttons they contain are interpreted correctly. We saw how LLMWhisperer rendered them in plain text and how using just a handful lines of code, we were able to define a schema and pull out structured data in the exact format we need using Langchain and Pydantic.

Links to libraries, packages, and code

The code for this guide can be found in this GitHub Repository
LLMWhisperer: A general purpose text extraction service that extracts data from images and PDFs, preparing it and optimizing it for consumption by Large Language Models or LLMs.
LLMwhisperer Python client on PyPI | Try LLMWhisperer Playground for free | Learn more about LLMWhisperer
Pydantic: Use Pydantic to declare your data model. This output parser allows users to specify an arbitrary Pydantic Model and query LLMs for outputs that conform to that schema.

For the curious. Who am I and why am I writing about PDF text extraction?

I'm Shuveb, one of the co-founders of Unstract.
Unstract is a no-code platform to eliminate manual processes involving unstructured data using the power of LLMs. The entire process discussed above can be set up without writing a single line of code. And that’s only the beginning. The extraction you set up can be deployed in one click as an API or ETL pipeline.

With API deployments you can expose an API to which you send a PDF or an image and get back structured data in JSON format. Or with an ETL deployment, you can just put files into a Google Drive, Amazon S3 bucket or choose from a variety of sources and the platform will run extractions and store the extracted data into a database or a warehouse like Snowflake automatically. Unstract is an Open Source software and is available at https://github.com/Zipstack/unstract.

If you want to quickly try it out, signup for our free trial. More information here.

Note: I originally posted this on the Unstract blog a couple of weeks ago.

Table Extraction and Processing from PDFs - A Tutorial

Shuveb Hussain — Thu, 11 Jul 2024 14:44:20 +0000

Tables in PDFs are very prevalent and for those of us who are intent on extracting information from PDFs, they often contain very useful information in a very dense format. Processing tables in PDFs, if not the top 10 priority for humanity, it certainly is an important one. Once data from PDFs is extracted, it’s relatively easy to process it using Large Language Models or LLMs. We need to do this in two phases: extract the raw data from tables in PDFs and then pass it to an LLM so that we can extract the data we need as structured JSON. Once we have JSON, it becomes super easy to process in almost any language.

LLMWhisperer for PDF raw text extraction

In this article, we’ll see how to use Unstract’s LLMWhisperer as the text extraction service for PDFs. As you’ll see in some examples we discuss, LLMWhisperer allows us to extract data from PDFs by page, so we can extract exactly what we need. The key difference between various OCR systems and LLMWhisperer is that it outputs data in a manner that is easy for LLMs to process. Also, we don’t have to worry about whether the PDF is a native text PDF, or is made up of scanned images. LLMWhisperer can automatically switch between text and OCR mode as required by the input document.

Before you run the project

The source code for the project can be found here on Github. To successfully run the extraction script, you’ll need 2 API keys. One for LLMWhisperer and the other for OpenAI APIs. Please be sure to read the Github project’s README to fully understand OS and other dependency requirements. You can sign up for LLMWhisperer, get your API key, and process up to 100 pages per day free of charge.

The input documents

Let’s take a look at the documents and the exact pages within the documents from which we need to extract the tables in question. First off, we have a credit card statement from which we’ll need to extract the table of spends, which is on page 2 as you can see in the screenshot below.

The other document we’ll use is one of Apple’s quarterly financial statements, also known as a 10-Q report. From this, we’ll extract Apple’s sales data by different regions in the world. We will extract a table that contains this information from page 14, a screenshot of which is provided below. You can find the sample 10-Q report in the companion Github repository.

The code

Using LLMWhisperer’s Python client, we can extract data from these documents as needed. LLMWhisperer is a cloud service and requires an API key, which you can get for free. LLMWhisperer’s free plan allows you to extract up to 100 pages of data per day, which is more than we need for this example.

def extract_text_from_pdf(file_path, pages_list=None):
   llmw = LLMWhispererClient()
   try:
       result = llmw.whisper(file_path=file_path, 
                           pages_to_extract=pages_list)
       extracted_text = result["extracted_text"]
       return extracted_text
   except LLMWhispererClientException as e:
       error_exit(e)

Just calling the whisper()\ method on the client, we’re able to extract raw text from images, native text PDFs, scanned PDFs, smartphone photos of documents, etc.

Here’s the extracted data from page 2 of our credit card statement, which contains the list of spends.

This is the extracted text from page 14 of Apple’s 10-Q document, which contains the sales by geography table towards the end:

Langchain+Pydantic to structure extracted raw tables

Langchain is a popular LLM programming library and Pydantic is one of the parsers it supports for structured data output. If you process a lot of documents of the same type, you might be better off using an open source platform like Unstract which allows you to more visually and interactively develop generic prompts. But for one-off tasks like the example we’re working on, where the main goal is to demonstrate extraction and structuring, Langchain should do just fine. For a comparison of approaches in structuring unstructured documents, read this article: Comparing approaches for using LLMs for structured data extraction from PDFs.

Defining the schema

To use Pydantic with Langchain, we define the schema or structure of the data we want to extract from the unstructured source as Pydantic classes. For the credit card statement spend items, this is how it looks like:

class CreditCardSpend(BaseModel):
    spend_date: datetime = Field(description="Date of purchase")
    merchant_name: str = Field(description="Name of the merchant")
    amount_spent: float = Field(description="Amount spent")

class CreditCardSpendItems(BaseModel):
    spend_items: list[CreditCardSpend] = Field(description="List of spend items from the credit card statement")

Looking at the definitions, CreditCardSpendItems\ is just a list of CreditCardSpend\, which defines the line item schema, which contains the spend date, merchant name and amount spent.

For the 10-Q regional sales details, the schema looks like the following:

class RegionalFinancialStatement(BaseModel):
    quarter_ending: datetime = Field(description="Quarter ending date")
    net_sales: float = Field(description="Net sales")
    operating_income: float = Field(description="Operating income")
    ending_type: str = Field(description="Type of ending. Set to either '6-month' or '3-month'")

class GeographicFinancialStatement(BaseModel):
    americas: list[RegionalFinancialStatement] = Field(description="Financial statement for the Americas region, "
                                                                   "sorted chronologically")
    europe: list[RegionalFinancialStatement] = Field(description="Financial statement for the Europe region, sorted "
                                                                 "chronologically")
    greater_china: list[RegionalFinancialStatement] = Field(description="Financial statement for the Greater China "
                                                                        "region, sorted chronologically")
    japan: list[RegionalFinancialStatement] = Field(description="Financial statement for the Japan region, sorted "
                                                                "chronologically")
    rest_of_asia_pacific: list[RegionalFinancialStatement] = Field(description="Financial statement for the Rest of "
                                                                               "Asia Pacific region, sorted "
                                                                               "chronologically")

Constructing the prompt and calling the LLM

The following code uses Langchain to let us define the prompts to structure data from the raw text like we need it. The compile_template_and_get_llm_response()\ function has all the logic to compile our final prompt from the preamble, the instructions from the Pydantic class definitions along with the extracted raw text. It then calls the LLM and returns the JSON response.

def compile_template_and_get_llm_response(preamble, extracted_text, pydantic_object):
    postamble = "Do not include any explanation in the reply. Only include the extracted information in the reply."
    system_template = "{preamble}"
    system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)
    human_template = "{format_instructions}\n\n{extracted_text}\n\n{postamble}"
    human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)


    parser = PydanticOutputParser(pydantic_object=pydantic_object)
    chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
    request = chat_prompt.format_prompt(preamble=preamble,
                                        format_instructions=parser.get_format_instructions(),
                                        extracted_text=extracted_text,
                                        postamble=postamble).to_messages()
    chat = ChatOpenAI()
    response = chat(request, temperature=0.0)
    print(f"Response from LLM:\n{response.content}")
    return response.content


def extract_cc_spend_from_text(extracted_text):
    preamble = ("You're seeing the list of spend items from a credit card statement and your job is to accurately "
                "extract the spend date, merchant name and amount spent for each transaction.")
    return compile_template_and_get_llm_response(preamble, extracted_text, CreditCardSpendItems)


def extract_financial_statement_from_text(extracted_text):
    preamble = ("You're seeing the financial statement for a company and your job is to accurately extract the "
                "revenue, cost of revenue, gross profit, operating income, net income and earnings per share.")
    return compile_template_and_get_llm_response(preamble, extracted_text, GeographicFinancialStatement)

The output: PDF to JSON

Let’s examine the JSON output from the credit card statement table and the region-wise sales table. Here’s the structured JSON from the credit card spend items table first:

{
  "spend_items": [
    {
      "spend_date": "2024-01-04",
      "merchant_name": "LARRY HOPKINS HONDA 7074304151 CA",
      "amount_spent": 265.40
    },
    {
      "spend_date": "2024-01-04",
      "merchant_name": "CICEROS PIZZA SAN JOSE CA",
      "amount_spent": 28.18
    },
    {
      "spend_date": "2024-01-05",
      "merchant_name": "USPS PO 0545640143 LOS ALTOS CA",
      "amount_spent": 15.60
    },
    {
      "spend_date": "2024-01-07",
      "merchant_name": "TRINETHRA SUPER MARKET CUPERTINO CA",
      "amount_spent": 7.92
    },
    {
      "spend_date": "2024-01-04",
      "merchant_name": "SPEEDWAY 5447 LOS ALTOS HIL CA",
      "amount_spent": 31.94
    },
    {
      "spend_date": "2024-01-06",
      "merchant_name": "ATT*BILL PAYMENT 800-288-2020 TX",
      "amount_spent": 300.29
    },
    {
      "spend_date": "2024-01-07",
      "merchant_name": "AMZN Mktp US*RT4G124P0 Amzn.com/bill WA",
      "amount_spent": 6.53
    },
    {
      "spend_date": "2024-01-07",
      "merchant_name": "AMZN Mktp US*RT0Y474Q0 Amzn.com/bill WA",
      "amount_spent": 21.81
    },
    {
      "spend_date": "2024-01-05",
      "merchant_name": "HALAL MEATS SAN JOSE CA",
      "amount_spent": 24.33
    },
    [some items removed for concision]
  ]
}

Excellent! We got exactly what we were looking for. Next, let’s look at the JSON output structured from the regional sales table from Apple’s 10-Q PDF.

{
    "americas": [
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 37273,
            "operating_income": 15074,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 37784,
            "operating_income": 13927,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 87703,
            "operating_income": 35431,
            "ending_type": "6-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 87062,
            "operating_income": 31791,
            "ending_type": "6-month"
        }
    ],
    "europe": [
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 24123,
            "operating_income": 9991,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 23945,
            "operating_income": 9368,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 54520,
            "operating_income": 22702,
            "ending_type": "6-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 51626,
            "operating_income": 19385,
            "ending_type": "6-month"
        }
    ],
    "greater_china": [
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 16372,
            "operating_income": 6700,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 17812,
            "operating_income": 7531,
            "ending_type": "3-month"
        },
        {
            "quarter_ending": "2024-03-30T00:00:00Z",
            "net_sales": 37191,
            "operating_income": 15322,
            "ending_type": "6-month"
        },
        {
            "quarter_ending": "2023-04-01T00:00:00Z",
            "net_sales": 41717,
            "operating_income": 17968,
            "ending_type": "6-month"
        }
    ],
    "japan": [
    "Some items removed for concision"
    ],
    "rest_of_asia_pacific": [
        "Some items removed for concision"
    ]
}

This output is as we expected, too. With this kind of structured output, it becomes very easy for us to process complex information in these kinds of tables further. The key to easy extraction of structured information is the availability of cleanly formatted input tables and LLMWhisperer here plays a key role in that extraction even for scanned documents or documents that are just smartphone photos.

Links to libraries, packages, and code

LLMWhisperer: A general purpose text extraction service that extracts data from images and PDFs, preparing it and optimizing it for consumption by Large Language Models or LLMs.
LLMwhisperer Python client on PyPI | Try LLMWhisperer Playground for free | Learn more about LLMWhisperer
The code for this guide can be found in this GitHub Repository
Pydantic: Use Pydantic to declare your data model. This output parser allows users to specify an arbitrary Pydantic Model and query LLMs for outputs that conform to that schema.

For the curious. Who am I and why am I writing about PDF text extraction?

If you want to quickly try it out, signup for our free trial. More information here.

Note: I originally posted this on the Unstract blog a couple of weeks ago.