TL;DR: use IMPORTHTML()
and IMPORTXML()
to webscrape :)
TOC:
Did you know you can webscrape using Google Sheets? Me neither, but it's the year 2022 and we have 9 more months in this year to learn so many new random things.
Bob's your uncle
I enjoy reading about personal finance and one such person I enjoy reading posts from is Bob at Tawcan.com. He has a post detailing how he uses IMPORTHTML()
and IMPORTXML()
to scrape dividend data from Google Finance and Yahoo Finance. This is really useful since the built in GOOGLEFINANCE()
function doesn't have this information.
I ended up modifying what he had and also had to learn about XPath
and XQuery
.
Here is one example of a cell function he uses to get the dividend and yield information from Yahoo Finance:
=SPLIT(INDEX(IMPORTHTML(concatenate("https://finance.yahoo.com/quote/",B2),"table",2),6,2)," ")
Using IMPORTHTML()
you only get a choice of scraping for "table"
or "list"
.
In the above example, B2
contains a stock ticker symbol like RY
which is concatenated with the url
to the page we want scraped. We end up getting the 2
nd table on the page with IMPORTHTML()
. Sheets ends up importing the table which isn't very useful for us so we further parse the table using INDEX()
to get the exact cell we want. In this case the dividend yield information is the 6th row, and 2nd column with the index starting from 1
. That returns a dividend and yield so we further SPLIT()
that information.
If you check out Bob's post, you'll see that he constantly needs to update his formulas because Google Finance and Yahoo Finance keep updating layouts. He also uses IMPORTXML()
at some point because the pages are dynamic and don't usually have that information statically available.
My Solution
I decided to find other sources for dividend and yield and came across dividend history. What's nice is that this source also has dividend and yield information for ETFs which isn't the case for Google and Yahoo Finance. I don't know how often the layout is updated but given how simple it is and how it serves a single purpose, I assume it won't be updated too often.
Scraping for Yield
=INDEX(SPLIT(IMPORTXML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "//p[contains(text(), 'Yield')]", "en_US"), ":%"), 1, 2)
In my use case, I like to keep ticker symbols in column A, and their exchanges in column B. This is important because Canadian stocks (TSX) are found in payout/TSX/<tickersymbol>
whereas American stocks are in payout/<tickersymbol>
. I also don't use CONCATENATE()
and prefer to just use "<string>"&A1&"<string>"
.
The second argument in IMPORTXML()
is the xquery_path
.
//p[contains(text(), 'Yield')]
Thankfully HTML is basically just a very specific XML amirite? Anyway the query looks for any <p>
that contains the text value 'Yield'
between the tags.
Yield: 1.23%
And since I only want the numbers I SPLIT()
at the :
and %
and take the second element (remember in Sheets, index starts at 1).
Scraping for Dividend ($)
=INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 2, "us_EN"), 4, 3)
The dividend is easier since I can use IMPORTHTML()
and select the exact cell from the 2
nd table on the page.
This isn't perfect because some tickers don't have any news or announcements, as is the case for VGRO
:
In this case, the dividend history table is actually the 1
st table on the page. We end up with a reference error because I'm looking for a 2
nd table that doesn't exist (out of bounds!):
#REF
My first solution to this problem was thinking maybe all ETFs or even REITs don't usually have announcements so I can track a column for the TYPE
of investment and pick 1
or 2
accordingly. This proved to be an incorrect assumption.
My second solution was to make the first IMPORTHTML()
request and if ISERR(<request>)
then make a second request searching for the 1
st table.
=IF(ISERR(G2),INDEX(IMPORTHTML("https://dividendhistory.org/payout/"&IF($B2="NASDAQ", "", "TSX/")&$A2, "table", 1, "us_EN"), 4, 3), G2)
This is really inefficient because it means making up to 2 requests for a single value. I stored the first request in column G
and call it the "test" column:
Anyway, that was a fun little rabbit hole and I think I'll find many uses for webscraping on sheets that may or may not involve personal finance!
References
https://www.tawcan.com/using-google-spreadsheet-dividend-investment/#Even_Newer_Formula1
https://support.google.com/docs/answer/3093281?hl=en
https://www.w3schools.com/xml/xpath_syntax.asp
Top comments (0)