DEV Community

Jian
Jian

Posted on

#003 | Automate PDF data extraction: Discovery

Overview

I first analysed the Custodian Statement PDF to understand its structure and identify datapoints relevant to the Project.

This helps me think of the business logic that needs to be translated to code.

On the coding aspect, I will use ChatGPT to guide me as I have no prior experience writing code to extract PDF data. Taking time to understand ChatGPT's suggestions is as important as getting the right answer, as I also want to improve my coding skills.

I'd also like to see how different the experience of sourcing solutions from ChatGPT is to platforms like Stackoverflow.


Step 1: Analyse PDF structure

The Custodian Statements have three major sections, each containing one or more tables.

In the snapshots below, I annotated the tables (blue), table headers (green) and non-table text (green) relevant to the Project requirements.

Section 1: Unit Trust
Image description

Section 2: Stock Holdings
Image description

Section 3: Cash Holdings
Image description

➡️ The business logic below needs to be translated to code

ID Business Logic
BL-1.0 Extract Fund Name, Units and Current Value values from "Unit trust holdings as at Statement Date" table
BL-2.0 Extract Stocks/ETFs Name, Exchange, Quantity, Price and Current Value from "Stock Holdings as at Statement Date" table
BL-3.0 Extract Opening Balance & Closing Balance values from "Cash Account Transactions in the month" table
BL-4.0 Extract the account number "M1234567" from "Account Number M1234567" string.
BL-5.0 Extract the date "Jan-2024" from "Statement for Jan-2024".

Step 2: Identify a suitable python library

As you can see from the snapshots, the Custodian Statement contains a mixture of tables and text.

I asked ChatGPT a general question on python libraries suited to PDF data extraction.

What python library can I use to extract PDF data?

Library Description
PyPDF2 PyPDF2 is a pure-Python library that can extract text from a PDF, but it is relatively simple and doesn't handle more complex PDFs (like those with embedded images or non-standard fonts) as well as other libraries.
pdfplumber pdfplumber is an excellent library for extracting both text and tables from PDFs. It works well for PDFs with complex layouts and tables.
PyMuPDF PyMuPDF (also known as fitz) is another robust library that can extract text and images from PDFs. It's often more efficient and accurate than PyPDF2 when handling complex PDFs.
pdfminer.six pdfminer.six is a Python 3-compatible version of pdfminer, a library for extracting information from PDFs. It is particularly useful if you need to extract detailed information from complex PDFs or perform fine-grained analysis.

➡️ Build the first iteration with pdfplumber


Step 3: Decide where extracted PDF data is written to

I am more familiar with using Excel and CSV files to analyse data. So while researching and figuring out how to set up a database, I will write the PDF data to a CSV file.

There should be three CSVs:

CSV 1: Fund Holdings
CSV 2: Securities Holdings
CSV 3: Cash Holdings

Writing to a CSV file means I can easily view and analyse the PDF-extracted data as I build the script.

Once I am confident the script does what it does, I can customise it to write the data to either a local database or a cloud-based database.

A CSV could be also be a storage solution. However, I think CSVs are less efficient at handling larger datasets and doing complex computations. Security could also be an issue, as the files can be easily stolen.


Step 4: Anticipate future challenges

  1. How "change-proof" is the PDF data extraction solution?

It's prudent to assume the Custodian Statement structure will change down the line. The PDF data extraction logic will need to be updated when this happens

This is not an immediate concern to be bogged down with, as I notice at least a years' worth of PDFs have the same format

  1. Does the date in BL-5.0 need to be transformed?

The current date format in BL-5.0 is in "mmm-yyyy" format.

Based on my experience using dates in Excel, it's always better if dates are a numerical string. This makes the data easier to sort and query.

I think this logic also applies when the data is stored in a database. Something to consider during the build phase.

--Ends

Top comments (0)