milktea02

Posted on Apr 8, 2022 • Edited on Apr 9, 2022

Google Sheets to track Dividends (How I pack and search the data after webscraping it)

#googlesheets #webscraping #finance #dividends

In my last post I wrote about using Google Sheets to webscrape dividend data from dividendhistory.org.

Today I'm going to share with you my adventures to improve efficiency and how to store data using sheets TRANSPOSE() and FLATTEN() functions and search using MATCH().

TOC

The improved scrape
Packing the data
Searching the data
- When am I going to get paid next?
- What about the other data?

The Improved Scrape

Before, to scrape the dividend history table I was using the IMPORTHTML() function but also noted that some times I may make up to two calls because the web page may only have one table instead of two causing essentially an index out of range error.

After reviewing the page source I found out the table has an id:



<table id='dividend_table' class='table table-striped table-bordered'>
  <thead>
    <tr>
      <th>Ex-Dividend Date</th>
      <th>Payout Date</th>
      <th>Cash Amount</th>
      <th>% Change</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><i>2022-10-29</i></td>
      <td><i>2022-11-26</i></td>
      <td><i>$1.33**</i></td>
      <td><i>unconfirmed/estimated</i></td>
    </tr>
    <tr>
      <td><i>2022-07-30</i></td>
      <td><i>2022-08-26</i></td>
      <td><i>$1.33**</i></td>
      <td><i>unconfirmed/estimated</i></td>
    </tr>   
    <tr>
      <td>2022-04-29</td>
      <td>2022-05-26</td>
      <td>$1.33</td>
      <td></td>
    </tr>
...
...
</table>

d'oh!

I can use IMPORTXML() and just query for this table directly!



=IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*")

And using that query looks like this in Sheets:

You can continue to use the INDEX(range, row, col) function to grab the exact cells you want but for myself I wanted to do a few fun things.

Packing the data

Now, I only really need some rows of data and I don't really care for the % Change column, so we use ARRAY_CONSTRAIN(range, rows, cols)



=ARRAY_CONSTRAIN(IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*"), 6, 3)

I want to further pack the data so that it occupies it's own row or column which can be done using TRANSPOSE() and FLATTEN() and you'll see that order will matter:

We want to apply TRANSPOSE() first then FLATTEN():



=FLATTEN(TRANSPOSE(ARRAY_CONSTRAIN(IMPORTXML("https://dividendhistory.org/payout/TSX/BMO", "//table[@id='dividend_table']/*/*"), 6, 3)))

Which makes it look like the columns are combined to a stack (hehe stack):

Searching the Data

If you ever had to take a course on algorithms and data structures you might remember doing a lot of stuff related to pointers and offsets and this is why I stored the data to fit one column!

When am I going to get paid next?

I can make use of the MATCH(key, range, [search_type]) formula to search for a relative offset.

The search_type is useful because I want to find the next payout date. To find the date that is equal to or greater than today I can set the search_type to -1 and luckily for us the dates are already sorted in descending order:

Sample of what we're going to do with the data:

As of writing, TODAY() is 2022-04-08 and we can see that the relative offset of the next payout date (2022-05-26) when searching from B14 to B18 is 3 from the top, so located in B16



=MATCH(TODAY(), B14:B18, -1)

We can also get the exact row where Payout Date starts at using ROW() (useful if you drag things here and there instead of hardcoding):



=ROW(B13)

So now we know that the next payout date is located in cell B(13 + 3 but how can we get the information from that cell address?

We can use the INDIRECT() function to reference a cell using an address that's in TEXT.

Basically we want a generic way to write:



=INDIRECT("B16")

The cool thing is we can use COLUMN and INDIRECT() also accepts address in RyCx format if we set the optional is_A1_notation flag to FALSE. Again I like using & to build strings in sheets:



=INDIRECT("R" & SUM(B1 + B3) & "C" & COLUMN(), FALSE)

And we can do the same for getting the next Cash Amount by using a different starting offset but same relative offset:



=INDIRECT("R" & SUM(B2 + B3) & "C" & COLUMN(), FALSE)

What about the other data?

In my last post I also scraped the yield. I also did this and with a bit of data massaging also tacked it onto the column as you can see here in my raw data spreadsheet example:

So ultimately I end up making only two calls to dividend history per ticker symbol which is much better than up-to two calls for the dividend amount, and one call for the yield. We also get all the useful data from the page.

Here is an example of what you can do with the raw data and only taking the relevant data:

There's all sorts of stuff you can do for example, taking the frequency and calculating how much dividend you could expect from a single share over a year. You can use IF() to decide if the frequency would be 12 (monthly) or 4 (quarterly).

And here is a link to the example if you would like to make a copy and check out the stuff for yourself. It's been pre-populated with Canadian banks and some American companies that pay dividends.

https://docs.google.com/spreadsheets/d/1CNfWRi7pY5zWcQ1htC5m557oBYS8-YKnMtTuStbhTaQ/edit?usp=drivesdk

Top comments (2)

albert • Feb 18 '24 • Edited

Thank you. I cant believe no one has commented on this in 2 yrs. Excellent tutorial on how you used these little known (to me) functions to wrangle all that great stock and dividend information.

One glitch: NASDAQ listings seem to add one line of information to the table id="news-table"

"Next Earnings: ..."

which does not appear in TSX news table and as a result throws off your table at the bottom of Raw Dividend Data