DEV Community: Jian

#014 | Standardising a multi-currency portfolio: Discovery

Jian — Fri, 21 Feb 2025 05:21:15 +0000

Overview

Having built and tested the Extract, Transforms & Load ("ETL") logic to process the Custodian Statement PDF and yfinance datasets, the building blocks to standardise a multi-currency portfolio to a common Base Currency are now in place.

To recap, the two datasets serve the following purposes:

Custodian Statement PDFs: A portfolio's month-end Securities, Funds, Cash and Miscellaneous ("Security Types") holdings
yfinance: Month-end foreign exchange rates ("FX rates") for selected Target Currencies relative to one Base Currency

The following flowchart visualises the end goal of this part of the MVP:

Portfolio Currency Standardisation Logic

The following logic is implemented to each Security Type:

Aggregate and sum month-end values by currency
Apply a mathematical formula to convert the summed data to a Base Currency. For example, US$1,000 / 0.25 (USD/MYR rate) = MYR4,000

The values from (2) are then summed to get the Portfolio NAV, which is now standardised to a common Base Currency.

This standardised Portfolio NAV paves the way for other MVP requirements to be fulfilled, such as calculating the Management Fee & Performance Fee.

The currency standardisation logic can be built via the Xano user interface. It helps that all the data is already sitting in the same platform.

As there is some repetitive logic, Xano's Custom Functions
can be leveraged to make it (the logic) reusable, much like a Python function.

Building a Custom Function is fairly straightforward. The challenge comes from ensuring the standardisation logic is correctly applied to the right data.

--Ends

#013 | Extract, Transform & Load FX rates: User Acceptance Testing

Jian — Wed, 12 Feb 2025 10:21:31 +0000

Overview

I built a basic Extract, Transform & Load ("ETL") process to store historical foreign currency rate ("FX rate") data for multiple Target Currencies relative to one Base Currency, the Malaysian Ringgit.

To recap, an FX rate tells you how much of a Target Currency can be exchanged for one Base Currency.

To test the ETL process, I did User Acceptance Testing ("UAT").

Test Parameters

Period: Nov 1, 2024 to Dec 31, 2024
Interval: Daily
Target Currencies: Singapore Dollar ("SGD") & US Dollar ("US Dollar")
Base Currency: Malaysian Ringgit ("MYR")

UAT Results: Pre-Fix

The first round of UAT failed, as I made a critical mistake of inputting the wrong currency order in yfinance's download() function. This returned FX rates with the Malaysian Ringgit as the Target Currency (instead of the Base Currency).

For example, instead of "MYRUSD=X" (how many US Dollars one Malaysian Ringgit buys), I input "USDMYR=X" (how many Malaysian Ringgit one US Dollar buys).

This potentially fatal mistake was overlooked during the Build phase, which I covered in the the previous post.

Thankfully doing UAT meant the mistake could be picked up.

Fixing the Code

I fixed the currency order to retrieve the desired FX rate.

Unfortunately, yfinance returned an error saying the "MYRSGD" currency pair/ticker was invalid. The "SGDMYR" pair, however, is valid.

It appears yfinance does not support Cross Currency Triangulation on every currency permutation. This means the MVP has to handle this transformation.

I reverted back to the original currency order input, but added a line to calculate the inverse of the returned FX rate. This transforms the Malaysian Ringgit into a Base Currency (from a Target Currency).

UAT Results: Post-Fix

The UAT was successful after applying a fix, and the data is now correctly transformed and loaded into Xano.

Before fixing the error:

After fixing the error:

--Ends

#012 | Extract, Transform & Load FX rates: Build

Jian — Mon, 03 Feb 2025 07:30:57 +0000

Overview

I wrote a script in Flask Python that runs a simple Extract, Transform & Load ("ETL") process on foreign exchange rate ("FX rate") data from yfinance.

To recap, yfinance is a Python library that sources FX rates from Yahoo Finance.

Background to FX rates

We first understand how FX rates are quoted and their relationships to each other. This influences the data transformation and storage logics.

An FX rate is conventionally quoted in the following format: [Base Currency]/[Target Currency]. This shows how much of the Target Currency one unit of the Base Currency buys.

Using the theory of Cross Currency Triangulation, we don't need to download and store every currency permutation. All we need are the FX rates for one Base Currency from yfinance, and other currency permutations can be indirectly calculated (if needed).

For example, assume we are given these quotes where the Malaysian Ringgit is the base currency:

MYR/USD=0.23 means one Malaysian Ringgit buys 0.23 US Dollars
MYR/SGD=0.30 means one Malaysian Ringgit buys 0.30 Singapore Dollars

We can indirectly calculate the USD/SGD rate (or how many Singapore Dollars one US Dollar buys) using the previously mentioned rates. If we are also given the MYR/HKD rate, we can calculate USD/HKD or SGD/HKD rates.

Database Table Setup

I set up two tables called reference_currencies and yfinance_fx_rates in Xano.

The former stores IBAN currency codes for 179 global currencies. The latter stores yfinance-sourced historical month-end FX rates for a list of selected Target Currencies relative to one Base Currency, the Malaysian Ringgit.

ETL Process: Extract

I used yfinance's download() function to extract the data, specifying the following parameters:

The raw, extracted dataframe of the MYR/USD and MYR/SGD quotes from November and December 2024 look like this:

Some characteristics:

Date is returned as a dataframe Index, not column
The dataframe's columns are "Adj Close", "Close", "High", "Low", "Open" and "Volume"
yfinance's FX rate convention is [Target Currency][Base Currency].

We transform the raw data to get the MVP's desired columns:

date: In "yyyymmdd" format
base_currency: 3-digit IBAN currency code
target_currency: 3-digit IBAN currency code
fx_rate: Extracted from yfinance

ETL Process: Transform

I first dropped irrelevant columns and filtered for the last FX rate quote of each month.

I then used pandas' melt() function to rename and reshape the dataframe. I also converted the Date index into a column.

I split each currency pair into two columns: target_currency & base_currency. I anticipate this makes the data more flexible for data manipulation.

This is the final dataframe, after re-ordering and dropping columns.

ETL Process: Load

To load the data into Xano, I created an API endpoint that accepts a JSON of the final dataframe. This process is quite straightforward and closely follow the steps in my earlier post.

The API endpoint has the following business logic: Based on the JSON input, retrieve the Primary Key of the Target Currency & Base Currency (from reference_currencies). If found, store the results in the yfinance_fx_rates table. I added a check that skips duplicate records.

Next Steps

We do basic User Acceptance Testing ("UAT") to evaluate the success of the ETL process, before proceeding to the next phase of the MVP

--Ends

#011 | Extract, Transform & Load FX rates : Discover

Jian — Fri, 24 Jan 2025 08:35:53 +0000

Overview

The requirements for the MVP's foreign exchange rate ("FX rate") data provider are:

Free access
Monthly historical data from at least 2018 onwards
FX rates for these currency pairs: USDMYR, SGDMYR, HKDMYR, GBPMYR and JPYMYR
Data can be easily retrieved

Given the plethora of financial market data providers, I asked ChatGPT for some options based on the MVP's requirements.

Selecting an FX rate provider

ChatGPT's suggestions were as follows:

I decided to prioritise trying Alpha Vantage and Yahoo Finance (via yfinance), as I have heard of these platforms before.

Alpha Vantage is backed by Y Combinator and delivers real-time and historical financial market data via Application Programming Interfaces ("API").

yfinance is a free Python library that can be installed via pip. The library scrapes data from Yahoo Finance's extensive, publicly-available financial database and returns it as a dataframe.

Option 1: Alpha Vantage

This platform offers a lifetime, free usage tier where API consumers can make up to 25 API calls per day. This membership tier is good enough for building MVPs.

The API lets users specify whether to return the results in Comma Separated Value ("CSV") or JSON format. I used Postman to a test API call:

I also did further tests that confirmed the currency pairs relevant to the MVP's needs could be met.

However, one thing I noted is the API does not support customised start and end dates. This potentially makes the API response quite large, and requires additional data transformation handling.

Option 2: yfinance

I used yfinance’s download() function to retrieve a sample of historical month-end FX rates for the USDMYR. The data is returned as a dataframe, making it quite easy to apply data transformation logic.

download() accepts many optional parameters that make it quite easy for users to filter for FX rate data relevant to their needs.

For example, the start and end dates can be customised using the start and end parameters. This flexibility is not supported by the Alpha Vantage API.

I did further tests that confirmed the currency pairs relevant to the MVP's needs could be met.

Next Steps

I decided to get FX rate data from yfinance as it meets the MVP's requirements. Although I am not using Alpha Vantage, it can be a good backup in the event any issues arise with yfinance.

The next step is to write a script in the MVP's Flask Python backend that:

Extracts FX rate data from yfinance
Transforms the data
Loads the data in a Xano database

--Ends

#010 | Why FX rate data is important to the MVP

Jian — Thu, 23 Jan 2025 07:43:45 +0000

The next phase of the project is to automate the retrieval & storage of foreign exchange rate ("FX rate") data. This is used to standardise a portfolio's value to a common currency called the Base Currency.

Before diving head first into how we are going to do this, let's first understand why a Base Currency is needed and how this impact's the MVP's requirements.

The Base Currency serves as a reference point for measuring a portfolio's value when it comprises assets denominated in many currencies. And a portfolio's value is the building block to calculating its returns.

Take for example the BlackRock Global Equity Income Fund ("BlackRock Global"). From the snapshot below (of the Fund's portfolio report), we can see the Fund invests in companies from Asia, Canada, Europe, Taiwan and the United States.

To calculate the portfolio's total value in US Dollar terms, the Fund's assets are standardised to the US Dollar from their respective currencies. The totals are then summed up.

To replicate this capability, the MVP needs FX rate data and functions to do these calculations.

Another point to make is that real-time FX rate data is not needed, as the MVP serves user personas who only need to tabulate month-end valuations.

This makes a difference to the MVP's requirements as real-time market data is often expensive to procure. And I also don't want to overbuild this MVP.

--Ends

#009 | Backend Database: User Acceptance Testing

Jian — Tue, 21 Jan 2025 09:31:26 +0000

Overview

The backend database User Acceptance Testing ("UAT") tests the four Application Programming Interface ("API") endpoints that add valid records to a Xano database.

To recap, the APIs have the following functionality:

Authenticate an API client
Accept a JSON input
Apply basic data checks to skip invalid records & store valid records

I used two external clients to do the UAT: the MVP's Flask Python backend and Postman. I wanted to try Postman as it's been a while since I've used it, and there's no harm refreshing my knowledge.

Test Scenario Coverage

I came up with test scenarios that cover the following:

Authentication: Is each endpoint secure?
Data validation: Does each endpoint correctly handle duplicate records and invalid entities?
Data storage: Does each endpoint store valid records in the correct table?

Sample Data & UAT Results

I re-used the same 10 Custodian Statement PDF statements (covering 71 pages) from the previous UAT as the sample dataset.

This approach allowed me to easily validate my results by comparing the current UAT output to the previous UAT output (in Excel) as Xano data can be easily exported to Comma Separate Value ("CSV") format.

The alternative - using a fresh set of sample data - would have been to compare the current output to the Custodian Statement PDFs. This is a really tedious task with no obvious benefit.

I executed 20 UAT test case scenarios in total, and there were fortunately no failures.

External API Client: Flask Python Backend

To call the Xano API endpoint using the Flask Python backend, I used Python's requests library.

The json_data variable contains the data extracted from the Custodian Statement PDF. The snapshot below is the results summary in my browser after a successful API call is made.

External API Client: Postman

Example 1: Invalid Auth Token

Entering an invalid AuthToken returns a 401 Unauthorized response. This is expected.

Example 2: Skip Duplicate Records

To simulate this test, I first added 234 valid records to the custodian_securities table using the API endpoint.

I then called the API endpoint with the same JSON input. As expected, none of these records were stored as they were all duplicates.

Example 3: Skip records with invalid entities

I created a JSON input with a fictitious entity called "Mickey Mouse". I then called the add_custodian_securities API endpoint.

As expected, a 200 Response Code was returned, with the skipped record listed in the response.

Next Steps

I previously did a flowchart to lay out the MVP's envisioned scope, and have updated it to illustrate what has been completed ("Phase 1") and what I plan to do next ("Phase 2").

--Ends

#008 | Backend Database: Build (Part 2)

Jian — Thu, 16 Jan 2025 01:43:46 +0000

Overview

I previously setup a Xano database - refer to #007 - to store data extracted from the Custodian Statement PDFs.

I have now built Application Programming Interface ("API") endpoints in Xano that:

Authenticate an API client
Accept a JSON input
Apply basic data checks to skip invalid records
Store valid records in the MVP's Xano database

The workflow below recaps what I am currently building:

These API endpoints can be consumed by any external client, such as my Flask Python backend, so long as they have a valid Auth Token.

Making a Xano API call

I called the /add_custodian_securities endpoint, which adds data to the custodian_securities table, with a JSON payload of hypothetical data extracted using my Flask Python backend. This table stores each client's month-end Securities holdings.

The JSON payload has three records, of which the second record is a duplicate of the first record. A duplicate record exists if there is more than one record with the same entity, account number and date combination.

{
  "new_records": [
    {
      "name": "AJI",
      "exchange": "KLSE",
      "quantity": "12800",
      "last_price": "15.40",
      "current_value": 197120,
      "account_number": "A0000000",
      "date": "20241130"
    },
    {
      "name": "AJI",
      "exchange": "KLSE",
      "quantity": "12800",
      "last_price": "15.40",
      "current_value": 197120,
      "account_number": "A0000000",
      "date": "20241130"
    },
    {
      "name": "HLIND",
      "exchange": "KLSE",
      "quantity": "39000",
      "last_price": "15.22",
      "current_value": 593580,
      "account_number": "B0000000",
      "date": "20241130"
    }
  ]
}

Xano API Builder & custodian_securities table before the API Call

Xano API Builder & custodian_securities table after a successful API Call

The endpoint's authentication mechanism also works, as an error was returned when an expired Auth Token was entered.

Why add data checks?

The quality of the MVP's data output is directly impacted by the quality of the data being ingested.

It is therefore important to have basic data checks to skip over invalid records before data ingestion. The endpoint should also return variables that inform the user of the data processing results.

Next Steps

I now have the building blocks to extract data from a list of PDFs and make an API call to store valid records in their respective Xano tables.

But before I proceed further, I will run some tests to ensure the endpoints are working as expected.

--Ends

#007 | Backend Database: Build (Part 1)

Jian — Fri, 10 Jan 2025 06:35:10 +0000

Overview

I built a database in Xano to store data extracted from the Custodian Statement PDFs. This database replaces the earlier Comma Separate Values ("CSV") file storage format.

I then connected my Flask Python backend to the Xano database using API endpoints, also built in Xano.

I found the Xano learning curve quite steep as A) I was new to it and B) I have very little experience with databases. That said, I managed to achieve my objective with a little trial & error, ChatGPT and help from Xano's tutorial videos.

This article will be split into two parts:

Part 1: Creating database tables in Xano
Part 2: Building Xano API endpoints to execute database queries

Step 1: Tables to store entity attributes ("Reference Tables")

I created Reference Tables to store attributes of clients, securities, funds and miscellaneous financial instruments. This process creates a unique Primary Key for each entity, which can be referenced as Foreign Keys in other tables.

The reference_securities table stores a security's full name, short name, stock code, trading currency and stock exchange. I populated reference_securities with data downloaded (legally) from the Singapore Exchange and Bursa Malaysia websites.

reference_funds stores each fund's full name, country of origin, short name and trading currency.

reference_clients stores each client's account number and name. Additional attributes can be added at a later stage, if there is a need.

Step 2: Tables to store Custodian Statements data

I created four tables to store data extracted from the Custodian Statement PDFs.

These tables are linked to one another using Foreign Keys, which are the Primary Keys of fields like a client's account number, security or fund.

The Xano snapshot below summarises the inter-table relationships, along with each table's fields.

Next Steps

With the database tables setup, I now need API endpoints to populate these tables with extracted Custodian Statement data.

The steps I took to build the API endpoints will be outlined in Part 2.

--Ends

#006 | Backend Database : Discovery

Jian — Tue, 17 Dec 2024 09:10:09 +0000

    * Overview

    * Xano: An Easy-to-use Database System

    * Building a Table

    * Building an API

    * Next Steps

Overview

I built a Python code to extract data from Custodian Statement PDFs and output the data into three CSV files.

Comma Separated Value ("CSV") files are a great tool for non-technical users who need to store and perform basic data analysis.

However, a more scalable storage solution is needed to handle larger datasets and complex data queries. Another drawback of a CSV file is that it is a single-user storage format.

Xano: An Easy-To-Use Backend System

I have been experimenting with Xano, a cloud-based low-code backend platform, since coming to know about it from this company's blog.

Xano is great for non-technical users because it offers a user-friendly, low-code user interface ("UI") on top of PostgreSQL. The latter is a popular open-source object-relational database system.

Xano also has a low-code API builder UI that helps you design and build APIs to connect your database to other systems.

After replacing the CSVs as a storage medium with Xano, the MVP's workflow looks like this:

To familiarise myself with Xano, I did the following:

Built a database table
Built a user signup API

Building A Table

Creating a table & adding fields

Xano's UI makes table creation a simple process. Users have the option to either import their own data (via CSV or Airtable) or enter data manually.

Adding a table field and specifying its type is also straightforward. It is also easy to make changes later on.

Generating random data for testing

This is very helpful when you want dummy data to test your application.

Exporting data

Should you wish to migrate your data, Xano lets you export your table data to CSV.

Building an API

An Application Programming Interface ("API") is needed to send the data extracted from Custodian Statement PDFs to Xano for storage purposes. Vice-versa, the data stored in Xano can also be retrieved later on via an API.

Xano has a pretty nifty low-code API builder that makes it easy to setup an API and add functions to determine the endpoints functionality.

For example, I created a user signup API that lets the Back Office Team create accounts. I added functions to limit the number of created accounts to 3 (to block unauthorised spam accounts from being created) and prevent a signup if an existing username exists.

Building the API's functionality

Successful user account creation

Precondition 1: Block creation of more than three user accounts

Precondition 2: Block creation of duplicate user accounts

Next Steps

I think I now have the basic knowledge to create APIs in Xano to:

Create user accounts
Login to a user account and return an Authentication Token
Store Securities Holdings data in securities_pdf
Store Fund Holdings data in funds_pdf
Store Cash Holdings data in cash_pdf

I also need to write code in my Python backend to send corresponding requests to these Xano APIs.

--Ends

#005 | Automate PDF data extraction: User Acceptance Testing

Jian — Thu, 12 Dec 2024 08:20:28 +0000

Overview

Prior to each feature release, I do User Acceptance Testing ("UAT") to surface bugs and ensure the business logic is correctly translated to code.

I only clear a feature for release after UAT is 100% successful.

My reasoning is simple: you only get one chance to make a good first impression to your end user, and a poor release makes it doubly hard to do so.

Although this is an MVP feature that isn't meant for production release, I thought it'd be good to do some UAT to keep my skills fresh.

Results

Of the 19 UAT scenarios I came up with, one failed because of a change in the Custodian Statement PDF template.

I anticipated this risk during Discovery, but truth be told, I did not expect the issue to crop up so soon.

I will go into the bug fix details later in the article.

Methodology

My UAT process involves using the business logic or feature requirements as a reference to create test scenarios and expected outcomes.

Test scenarios don't need to be complicated. They can be as simple as : "The feature generates a CSV file within 30 seconds".

For the UAT, I processed 71 pages of documents from 10 Custodian Statement PDFs. This should be a sufficiently large enough sample set.

The expected output is three CSV files containing specific datapoints from the Fund Holdings, Securities Holdings and Cash Holdings sections of the Custodian Statement PDF.

I came up with the following test cases:

CSV 1: Fund Holdings

CSV 2: Securities Holdings

CSV 3: Cash Holdings

Bug Fixing

The one failed test was because the Custodian Statement PDF's template changed slightly in November. More specifically, the values in the "Current Value# 1. Foreign Currency 2. RM Equivalent" column of a Fund Holdings table now has an extra "-\n" prefix.

For example, instead of reading "USD 10,000" in previous PDFs, the value now reads "- USD10,000".

This small change resulted in the following issue:

I consulted ChatGPT on a fix, and it recommended the following scrubbing logic be added to remove the incorrect "-/n" prefix.

# Scrub error prefix
df['Currency'] = df['Currency'].str.replace('[-\n]', '', regex=True)

The scrubbing did the trick and the Fund Holdings CSV output now comes out as expected.

What Next?

I'm now comfortable that the code to extract PDF data is functional. That said, I don't think a CSV file is the best place to store all this data.

While CSV is user friendly (to me), storing data in a database makes it much easier to retrieve and manipulate data as per the end user's requirements.

I have very limited experience in databases. So what I'll do next is Discovery on a database application that I can onboard quickly.

--Ends

#004 | Automate PDF data extraction: Build

Jian — Thu, 05 Dec 2024 07:45:00 +0000

Overview

I wrote a Python script that translates the PDF data extraction business logic into working code.

The script was tested on 71 pages of Custodian Statement PDFs covering a 10 month period (Jan to Oct 2024). Processing the PDFs took about 4 seconds to complete - significantly quicker than doing it manually.

From what I see, the output looks correct and the code did not run into any errors.

Snapshots of the three CSV outputs are shown below. Note that sensitive data has been greyed out.

Snapshot 1: Stock Holdings

Snapshot 2: Fund Holdings

Snapshot 3: Cash Holdings

This workflow shows the broad steps I took to generate the CSV files.

Now, I will elaborate in more detail how I translated the business logic to code in Python.

Step 1: Read PDF documents

I used pdfplumber's open() function.

# Open the PDF file
with pdfplumber.open(file_path) as pdf:

file_path is a declared variable that tells pdfplumber which file to open.

Step 2.0: Extract & filter tables from each page

The extract_tables() function does the hard work of extracting all tables from each page.

Though I am not really familiar with the underlying logic, I think the function did a pretty good job. For example, the two snapshots below show the extracted table vs. the original (from the PDF)

Snapshot A: Output from VS Code Terminal

Snapshot B: Table in PDF

I then needed to uniquely label each table, so that I could "pick and choose" data from specific tables later on.

The ideal option was to use each table's title. However, determining the title coordinates were beyond my capabilities.

As a workaround, I identified each table by concatenating the headers of the first three columns. For example, the Stock Holdings table in Snapshot B is labeled Stocks/ETFs\nNameExchangeQuantity.

⚠️This approach has a serious drawback - the first three header names do not make all tables sufficiently unique. Fortunately, this only impacts irrelevant tables.

Step 2.1: Extract, filter & transform non-table text

The specific values I needed - Account Number and Statement Date - were sub-strings in Page 1 of each PDF.

For example, "Account Number M1234567" contains account number "M1234567".

I used Python's re library and got ChatGPT to suggest suitable regular expressions ("regex"). The regex breaks up each string into two groups, with the desired data in the second group.

Regex for Statement Date and Account Number strings

regex_date=r'Statement for \b([A-Za-z]{3}-\d{4})\b'
regex_acc_no=r'Account Number ([A-Za-z]\d{7})'

I next transformed the Statement Date into "yyyymmdd" format. This makes it easier to query and sort data.

 if match_date:
    # Convert string to a mmm-yyyy date
    date_obj=datetime.strptime(match_date.group(1),"%b-%Y")
    # Get last day of the month
    last_day=calendar.monthrange(date_obj.year,date_obj.month[1]
    # Replace day with last day of month
    last_day_of_month=date_obj.replace(day=last_day)
    statement_date=last_day_of_month.strftime("%Y%m%d")

match_date is a variable declared when a string matching the regex is found.

Step 3: Create tabular data

The hard yards - extracting the relevant datapoints - were pretty much done at this point.

Next, I used pandas' DataFrame() function to create tabular data based on the output in Step 2 and Step 3. I also used this function to drop unnecessary columns and rows.

The end result can then be easily written to a CSV or stored in a database.

Step 4: Write data to CSV file

I used Python's write_to_csv() function to write each dataframe to a CSV file.

write_to_csv(df_cash_selected,file_cash_holdings)

df_cash_selected is the Cash Holdings dataframe while file_cash_holdings is the file name of the Cash Holdings CSV.

➡️ I will write the data to a proper database once I have acquired some database know-how.

Next Steps

A working script is now in place to extract table and text data from the Custodian Statement PDF.

Before I proceed further, I will run some tests to see if the script is working as expected.

--Ends

#003 | Automate PDF data extraction: Discovery

Jian — Mon, 02 Dec 2024 06:16:53 +0000

Overview

I first analysed the Custodian Statement PDF to understand its structure and identify datapoints relevant to the Project.

This helps me think of the business logic that needs to be translated to code.

On the coding aspect, I will use ChatGPT to guide me as I have no prior experience writing code to extract PDF data. Taking time to understand ChatGPT's suggestions is as important as getting the right answer, as I also want to improve my coding skills.

I'd also like to see how different the experience of sourcing solutions from ChatGPT is to platforms like Stackoverflow.

Step 1: Analyse PDF structure

The Custodian Statements have three major sections, each containing one or more tables.

In the snapshots below, I annotated the tables (blue), table headers (green) and non-table text (green) relevant to the Project requirements.

Section 1: Unit Trust

Section 2: Stock Holdings

Section 3: Cash Holdings

➡️ The business logic below needs to be translated to code

ID	Business Logic
BL-1.0	Extract Fund Name, Units and Current Value values from "Unit trust holdings as at `Statement Date`" table
BL-2.0	Extract Stocks/ETFs Name, Exchange, Quantity, Price and Current Value from "Stock Holdings as at `Statement Date`" table
BL-3.0	Extract Opening Balance & Closing Balance values from "Cash Account Transactions in the month" table
BL-4.0	Extract the account number "M1234567" from "Account Number M1234567" string.
BL-5.0	Extract the date "Jan-2024" from "Statement for Jan-2024".

Step 2: Identify a suitable python library

As you can see from the snapshots, the Custodian Statement contains a mixture of tables and text.

I asked ChatGPT a general question on python libraries suited to PDF data extraction.

What python library can I use to extract PDF data?

Library	Description
PyPDF2	PyPDF2 is a pure-Python library that can extract text from a PDF, but it is relatively simple and doesn't handle more complex PDFs (like those with embedded images or non-standard fonts) as well as other libraries.
pdfplumber	pdfplumber is an excellent library for extracting both text and tables from PDFs. It works well for PDFs with complex layouts and tables.
PyMuPDF	PyMuPDF (also known as fitz) is another robust library that can extract text and images from PDFs. It's often more efficient and accurate than PyPDF2 when handling complex PDFs.
pdfminer.six	pdfminer.six is a Python 3-compatible version of pdfminer, a library for extracting information from PDFs. It is particularly useful if you need to extract detailed information from complex PDFs or perform fine-grained analysis.

➡️ Build the first iteration with pdfplumber

Step 3: Decide where extracted PDF data is written to

I am more familiar with using Excel and CSV files to analyse data. So while researching and figuring out how to set up a database, I will write the PDF data to a CSV file.

There should be three CSVs:

CSV 1: Fund Holdings
CSV 2: Securities Holdings
CSV 3: Cash Holdings

Writing to a CSV file means I can easily view and analyse the PDF-extracted data as I build the script.

Once I am confident the script does what it does, I can customise it to write the data to either a local database or a cloud-based database.

A CSV could be also be a storage solution. However, I think CSVs are less efficient at handling larger datasets and doing complex computations. Security could also be an issue, as the files can be easily stolen.

Step 4: Anticipate future challenges

How "change-proof" is the PDF data extraction solution?

It's prudent to assume the Custodian Statement structure will change down the line. The PDF data extraction logic will need to be updated when this happens

This is not an immediate concern to be bogged down with, as I notice at least a years' worth of PDFs have the same format

Does the date in BL-5.0 need to be transformed?

The current date format in BL-5.0 is in "mmm-yyyy" format.

Based on my experience using dates in Excel, it's always better if dates are a numerical string. This makes the data easier to sort and query.

I think this logic also applies when the data is stored in a database. Something to consider during the build phase.

--Ends

DEV Community: Jian

#014 | Standardising a multi-currency portfolio: Discovery

Overview

Portfolio Currency Standardisation Logic

#013 | Extract, Transform & Load FX rates: User Acceptance Testing

Overview

Test Parameters

UAT Results: Pre-Fix

Fixing the Code

UAT Results: Post-Fix

#012 | Extract, Transform & Load FX rates: Build

Overview

Background to FX rates

Database Table Setup

ETL Process: Extract

ETL Process: Transform

ETL Process: Load

Next Steps

#011 | Extract, Transform & Load FX rates : Discover

Overview

Selecting an FX rate provider

Option 1: Alpha Vantage

Option 2: yfinance

Next Steps

#010 | Why FX rate data is important to the MVP

#009 | Backend Database: User Acceptance Testing

Overview

Test Scenario Coverage

Sample Data & UAT Results

External API Client: Flask Python Backend

External API Client: Postman

Example 1: Invalid Auth Token

Example 2: Skip Duplicate Records

Example 3: Skip records with invalid entities

Next Steps

#008 | Backend Database: Build (Part 2)

Overview

Making a Xano API call

Why add data checks?

Next Steps

#007 | Backend Database: Build (Part 1)

Overview

Step 1: Tables to store entity attributes ("Reference Tables")

Step 2: Tables to store Custodian Statements data

Next Steps

#006 | Backend Database : Discovery

Table Of Contents

Overview

Xano: An Easy-To-Use Backend System

Building A Table

Creating a table & adding fields

Generating random data for testing

Exporting data

Building an API

Next Steps

#005 | Automate PDF data extraction: User Acceptance Testing

Overview

Results

Methodology

Bug Fixing

What Next?

#004 | Automate PDF data extraction: Build

Overview

Step 1: Read PDF documents

Step 2.0: Extract & filter tables from each page

Step 2.1: Extract, filter & transform non-table text

Step 3: Create tabular data

Step 4: Write data to CSV file

Next Steps

#003 | Automate PDF data extraction: Discovery

Overview

Step 1: Analyse PDF structure

Step 2: Identify a suitable python library

Step 3: Decide where extracted PDF data is written to

Step 4: Anticipate future challenges