Overview
I first analysed the Custodian Statement PDF to understand its structure and identify datapoints relevant to the Project.
This helps me think of the business logic that needs to be translated to code.
On the coding aspect, I will use ChatGPT to guide me as I have no prior experience writing code to extract PDF data. Taking time to understand ChatGPT's suggestions is as important as getting the right answer, as I also want to improve my coding skills.
I'd also like to see how different the experience of sourcing solutions from ChatGPT is to platforms like Stackoverflow.
Step 1: Analyse PDF structure
The Custodian Statements have three major sections, each containing one or more tables.
In the snapshots below, I annotated the tables (blue), table headers (green) and non-table text (green) relevant to the Project requirements.
➡️ The business logic below needs to be translated to code
ID | Business Logic |
---|---|
BL-1.0 | Extract Fund Name, Units and Current Value values from "Unit trust holdings as at Statement Date " table |
BL-2.0 | Extract Stocks/ETFs Name, Exchange, Quantity, Price and Current Value from "Stock Holdings as at Statement Date " table |
BL-3.0 | Extract Opening Balance & Closing Balance values from "Cash Account Transactions in the month" table |
BL-4.0 | Extract the account number "M1234567" from "Account Number M1234567" string. |
BL-5.0 | Extract the date "Jan-2024" from "Statement for Jan-2024". |
Step 2: Identify a suitable python library
As you can see from the snapshots, the Custodian Statement contains a mixture of tables and text.
I asked ChatGPT a general question on python libraries suited to PDF data extraction.
What python library can I use to extract PDF data?
Library | Description |
---|---|
PyPDF2 | PyPDF2 is a pure-Python library that can extract text from a PDF, but it is relatively simple and doesn't handle more complex PDFs (like those with embedded images or non-standard fonts) as well as other libraries. |
pdfplumber | pdfplumber is an excellent library for extracting both text and tables from PDFs. It works well for PDFs with complex layouts and tables. |
PyMuPDF | PyMuPDF (also known as fitz) is another robust library that can extract text and images from PDFs. It's often more efficient and accurate than PyPDF2 when handling complex PDFs. |
pdfminer.six | pdfminer.six is a Python 3-compatible version of pdfminer, a library for extracting information from PDFs. It is particularly useful if you need to extract detailed information from complex PDFs or perform fine-grained analysis. |
➡️ Build the first iteration with pdfplumber
Step 3: Decide where extracted PDF data is written to
I am more familiar with using Excel and CSV files to analyse data. So while researching and figuring out how to set up a database, I will write the PDF data to a CSV file.
There should be three CSVs:
CSV 1: Fund Holdings
CSV 2: Securities Holdings
CSV 3: Cash Holdings
Writing to a CSV file means I can easily view and analyse the PDF-extracted data as I build the script.
Once I am confident the script does what it does, I can customise it to write the data to either a local database or a cloud-based database.
A CSV could be also be a storage solution. However, I think CSVs are less efficient at handling larger datasets and doing complex computations. Security could also be an issue, as the files can be easily stolen.
Step 4: Anticipate future challenges
- How "change-proof" is the PDF data extraction solution?
It's prudent to assume the Custodian Statement structure will change down the line. The PDF data extraction logic will need to be updated when this happens
This is not an immediate concern to be bogged down with, as I notice at least a years' worth of PDFs have the same format
- Does the date in
BL-5.0
need to be transformed?
The current date format in BL-5.0
is in "mmm-yyyy" format.
Based on my experience using dates in Excel, it's always better if dates are a numerical string. This makes the data easier to sort and query.
I think this logic also applies when the data is stored in a database. Something to consider during the build phase.
--Ends
Top comments (0)