DEV Community: Maximo Guerrero

Use GPT4-Vision for PDF to JSON data extraction

Maximo Guerrero — Thu, 14 Mar 2024 13:30:13 +0000

Converting PDFs to Structured JSON

PDF files are commonly used for storing and sharing documents, but extracting data from them can be a challenging task. The PDF-GPT4-JSON project aims to simplify this process by leveraging the power of GPT-4 Vision, a state-of-the-art language model, to convert PDF files into structured JSON format. In this article, we will explore the theory behind this conversion process and discuss how it can be applied in real-world scenarios.

The Challenge of OCR Data in PDFs

One of the main challenges in extracting data from PDFs is the accuracy of the OCR (Optical Character Recognition) process. OCR is used to convert scanned or image-based PDFs into searchable and editable text. However, OCR data can often contain inaccuracies and garbage characters, especially in complex layouts or low-quality scans. This can result in errors and inconsistencies in the extracted text.

To address this challenge, the PDF-GPT4-JSON cli uses GPT-4 Vision, which has been fine-tuned for image understanding and analysis. By leveraging deep learning techniques, GPT-4 Vision can effectively analyze the layout of the text in PDFs and infer the hierarchical structure of the data. This helps to mitigate the impact of inaccurate OCR data and generate more accurate and structured JSON output.

Generating Structured JSON with GPT-4 Vision

The process of generating structured JSON using GPT-4 Vision involves several steps:

PDF Parsing: The PDF file is parsed to extract the textual content and layout information of each page. This includes identifying the position, size, and formatting of the text elements.
Text Extraction: The extracted text is processed to remove noise and irrelevant information, such as headers, footers, and page numbers. This helps to focus on the main content of the PDF.
Layout Analysis: GPT-4 Vision analyzes the layout of the text on each page to identify the hierarchical structure of the data. It looks for patterns, indentation, and formatting cues to infer the relationships between different elements. For example, it can identify headings, subheadings, lists, and tables.
JSON Generation: Based on the layout analysis, GPT-4 Vision generates a structured JSON representation of the PDF content. Each page is represented as a separate JSON file, with nested objects and arrays to capture the hierarchical relationships. This allows for easy navigation and extraction of specific data elements.

Installation and Usage

This aritcles assumes you have installed Python 3.10 or greater.

To use the PDF-GPT4-JSON cli, you need to install it via pip :

pip install pdf_gpt4_json

You also need to set your OpenAI API key by either exporting it as an environment variable or passing it as a command-line argument to the tool.

Once installed, you can run the conversion script by providing the path to the PDF file:

pdf-gpt4-json ./sample.pdf

This will generate a temporary working folder and an output folder with JSON files for each page of the PDF. The output folder will be named after the PDF file, with the prefix "samplepdf_final_folders" in this case.

The project also provides additional parameters that can be adjusted to customize the conversion process. These parameters include the path to a prompt file, the OpenAI API key, the model to use, verbosity level, and whether to clean up temporary files after processing.

Based on the first page of our sample.pdf [ original document from propublica ]a IRS 990 tax form (this is a public document that non profits must file, company Employer Identification Number (EIN) is public so it was not redacted.) it can output the following json:

{
    "Form": "990-PF",
    "Return of Private Foundation": {
        "Year": "2022",
        "Tax year beginning": "01-01-2022",
        "Tax year ending": "12-31-2022"
    },
    "Name of foundation": "THE RHODODS FOUNDATION",
    "Address": {
        "Number and street": "13-15 W 54th ST",
        "City or town": "NEW YORK",
        "State": "NY",
        "ZIP code": "10019"
    },
    "Employer identification number": "23-102392",
    "Part I - Analysis of Revenue and Expenses": {
        "Contributions, gifts, grants, etc., received": "",
        "Interest on savings and temporary cash investments": "",
        "Dividends and interest from securities": "280,358",
        "Gross rents": "",
        "Net rental income or (loss)": "",
        "Net gain or (loss) from sale of assets not on line 10": "-6,068",
        "Capital gain net income (from Part IV, line 2)": "3,219,668",
        "Net short-term capital gain": "",
        "Income modifications": "",
        "Total (add lines 1 through 9)": "3,494,040",
        "Expenses and Disbursements for Charitable Purposes (attach schedule)": {
            "Compensation of officers, directors, trustees, etc.": "",
            "Other employee salaries and wages": "",
            "Pension plans, employee benefits": "",
            "Legal fees (attach schedule)": "",
            "Accounting fees (attach schedule)": "13,000",
            "Other professional fees (attach schedule)": "6,500",
            "Interest": "",
            "Taxes (attach schedule)": "",
            "Depreciation (attach schedule) and depletion": "",
            "Occupancy": "",
            "Travel, conferences, and meetings": "",
            "Printing and publications": "",
            "Other expenses (attach schedule)": "53,134",
            "Total operating and administrative expenses": "157,584",
            "Contributions, gifts, grants paid": "555,082",
            "Total expenses and disbursements": "712,666",
            "Excess of revenue over expenses and disbursements": "2,781,374",
            "Net investment income": "2,485,978",
            "Adjusted net income": "231,296"
        }
    }
}

Applications and Benefits

The PDF-GPT4-JSON project opens up a wide range of possibilities for developers and data analysts. Here are some potential applications and benefits:

Data Extraction: The structured JSON output makes it easy to extract specific data elements from PDFs, such as tables, lists, or headings. This can be useful for data analysis, data mining, or integrating PDF data into other systems.
Automation: By automating the PDF-to-JSON conversion process, developers can save time and effort in manually extracting data from PDFs. This can be particularly beneficial for large volumes of PDF files or recurring data extraction tasks.
Integration: The JSON output can be easily integrated into existing workflows or applications. For example, it can be used as a data source for business intelligence dashboards, machine learning models, or data visualization tools.
Data Processing: The structured JSON format allows for easy manipulation and processing of PDF data. Developers can apply various data processing techniques, such as filtering, aggregation, or transformation, to derive insights or generate new data sets.

In conclusion, the PDF-GPT4-JSON project provides a powerful solution for converting PDF files into structured JSON format. By leveraging the capabilities of GPT-4 Vision, it simplifies the extraction and analysis of data from PDFs, opening up new possibilities for developers and data analysts. Whether it's automating data extraction, integrating PDF data into workflows, or performing advanced data processing, the PDF-GPT4-JSON project offers a versatile tool for working with PDFs.

Simple Queue in PostgreSQL

Maximo Guerrero — Tue, 31 Dec 2019 22:08:06 +0000

Sometimes you need a simple job queue to enable offline or deferred processing of data. While this is not a pub/sub system which is what a lot of people us Queues for, you could expand upon it.

I will be using Postgres as the database (you could apply the concepts in this post to other RDMS). What this is not is a tutorial in SQL, I assume you know how to write and understand basic sql(select, insert, update, and delete).

Let's start with the definition of our table. One thing you will notice is that we have a column called jobData of type json, this is so that you store the data need to process a job in a structured way. Feel free to change it to TEXT or anything else.


CREATE TABLE public."jobQueue"
(
    "jobId" serial NOT NULL ,
    "jobData" json NOT NULL,
    status character varying ,
    added timestamp without time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
    started timestamp without time zone,
    ended timestamp without time zone,
    CONSTRAINT "jobQueue_pkey" PRIMARY KEY ("jobId")
)

Now the secret to using a database table as a queue, is to lock the row for updates while you're getting a job from the queue.

Let's create a function that will take as an argument the number of jobs to pull off the queue.


CREATE OR REPLACE FUNCTION public."jobQueue_getJobs"(
    "_numJobsToGet" integer)
    RETURNS TABLE("jobId" integer, status character varying) 
    LANGUAGE 'plpgsql'


AS $BODY$

DECLARE _jobId int;
DECLARE _status character varying;

BEGIN
   FOR _jobId,_status   IN
             SELECT jq1."jobId", jq1.status, jq1.type
                FROM "jobQueue" jq1
               WHERE 
        jq1.status = 'new'
        and
        jq1.added < NOW() 
        ORDER BY added
               LIMIT "_numJobsToGet" FOR UPDATE SKIP LOCKED
   LOOP
      UPDATE "jobQueue" jq2 SET jq2."status" = 'pickedup'
         WHERE jq2."jobId" = _jobId;
      RETURN NEXT ;
   END LOOP;
   RETURN;
END

So executing this function will return a list of jobIds' that then you can be used to select and update the jobQueue table without any other process stepping on your jobs.

One thing about this process is that if a process dies and a job is stuck in limbo you can restart it without any data loss.

Once your done processing the job you can mark it as done or remove it from the queue if you are space constrained.

Easy email dashboards with dashflow.io

Maximo Guerrero — Sat, 29 Jun 2019 15:35:58 +0000

If you need to create dashboards sent via email driven by a database check out https://dashflow.io. It can be used both as a command line utility or as python module.

Even with all the BI tools, I still get a request for dashboards to be emailed or printed. Tools like Tableau don't lend themselves to print-friendly or email friendly. Some people just don't have the time to go several clicks in and drill down when they are in meetings all day.

Dash flow will work out of the box with any python db-api2 database module which can be used with sqlalchemy. Your database queries are all stored in sql files, under the hood it uses https://pugsql.org/ which can be best described as a reverse ORM. Entirely built on open-source modules.

The email is constructed via a simple Json file that has an array of sections. Sections can be anything from tables, charts or text with data mixed in.

For the moment it only supports sending via smtp. But if you want to try it out, https://sendgrid.com has a nice free tier.

Charts are rendered using https://quickchart.io/.

Now, how about we show the steps to send an email.

Clone and setup dashflow

git clone https://github.com/maximoguerrero/dashflow.git
cd dashflow
chmod u+x df-cli.py

Create folders for the email

mkdir testemail
cd testemail

mkdir sql

cp ../sample/sample.db sample.db

Create your Sql file

Since we are using pusql the first line of the SQL file is a comment that is used to define the module and its return type. Here the modules name is media_profit_by_type_filtered and that is what you will use in your config file.

This query simply returns the media type that was most profitable from our sample database. Save this file in the sql folder we created in the previous step

-- :name media_profit_by_type_filtered    :many
SELECT  mt.Name as label, strftime('%Y',InvoiceDate) as year  ,  printf("%.2f", sum(ii.UnitPrice ) )  as value
FROM `tracks` t
inner join `media_types` mt on mt.MediaTypeId = t.MediaTypeId
inner join `invoice_items` ii on ii.TrackId = t.TrackId
inner join `invoices` i on i.InvoiceId = ii.InvoiceId
where mt.Name like '%' || :kind || '%'
group by t.MediaTypeId, year
order by year;

Next create a simple json

In this file, we will provide a couple of things. A connection string compatible with sqlAlchemy, a relative path to the SQL folder where the SQL files are. A quickcharts.io compatible service URL.

The section array contains a definition for a chart, moduleName is what we defined in the SQL file. The parameter is one that was used in the SQL file.

The next important part is the SMPT section where you provide parameters for SMPT Server.

{
    "isDebug": true,
    "connection": "sqlite:///<absolutePath>/sample.db",
    "sqlfolder": "sql/",
    "quickChartsUrl": "https://quickcharts.dashflow.io/chart?bkg=white&c=",
    "title": "This Months Dashboard",
    "description": "Ad reprehenderit amet mollit Lorem aliquip sint anim ipsum nisi deserunt commodo veniam magna.",
    "sections": [
        {
            "type": "chart",
            "sectionTitle": "horizontal bar chart! with query parramter",
            "description": "Irure esse eu officia consequat mollit ullamco est aliquip..",
            "moduleName": "media_profit_by_type_filtered",
            "moduleParameters": [
                {
                    "name": "kind"
                }
            ],
            "groupBy": "year",
            "chart": {
                "type": "horizontalBar"
            }
        }
    ],
    "to": [
        {
            "type": "string",
            "value": "email.to@send.com"
        }

    ],
    "from": "email@from.com",
    "subject": "my first email",
    "sendEngine": {
        "type": "smtp",
        "host": "YOUR_SMTP_HOST",
        "port": "587",
        "enableTLS": true,
        "requiresAuthentication": true,
        "useEnvVariable":{
            "username": "SMTP_USER",
            "password": "SMTP_PWD"

        }
    }
}

Next lets run the script

./df-cli.py --configFile testemail/sample-config.json --parameter="kind:audio"

You will get the following email: