Introduction
Sometimes we need to get some information from a website as soon as possible after it is published, and just manually visiting the website every hour to check if the information is already there is just not a practical way to do it, specially if you are a programmer and have the ability to do it automatically for you.
In my case, I need to know the date when new slots for appointments to complete some paperwork in the Consulate of Spain in Buenos Aires.
They usually provide the date with a very short notice (or none at all) and the available slots for the appointments are
filled sooner than I have time to even realize that the slots were available in the first place.
So, I implemented a very simple but yet effective scraper that visits the website every hour, and parses the
information. If a new date is found for the slots, it will notify me via GMail and Telegram, so I don't miss it.
NOTE: The full code can be found in this Github repo, so in this post I'll only share code snippets
without many very important features that are not essential to understand the key concept, and that can also get in the
way of it on a first read (e.g., file path resolution, error handling, logging, tests, etc). If you want to check out the program with all those features, do check the repo.
How to Implement the Scraper
The website is just HTML with no need for rendering dynamic content or providing credentials of any kind. The objective
is to scrape a table element, looking for the information contained in a single row that we can identify for the content
of the first column.
Therefore, there is no need to use scraping frameworks like Scrapy, which are awesome but, for this particular case,
also a huge overkill. Simply using the Python requests
library to get the website response will suffice, and parsing
the information with the also amazing Beautiful Soup will achieve the results beautifully (pun
intended).
To get the response and, specifically, the target row, we just need to use a few lines of code:
# Fetch the response.
req = requests.get(<url>, timeout=10)
soup = BeautifulSoup(req.text, features="html.parser")
# Get the target row from the table.
target_row = soup.find_all(lambda x: x.name == "tr" and "Registro Civil-Nacimientos" in x.text)
target_cells = row.find_all("td")
# Parse information into a dictionary.
result = {
"servicio": cells[0].text,
"ultima_apertura": cells[1].text,
"proxima_apertura": cells[2].text,
"solicitud": cells[3].a.get("href"),
}
That's it, now we have the information. Most tutorials I've found online for simple scrapers end here, but this is still
completely useless (like those tutorials) unless we actually do something with that data.
Set up Notifications and Deploy to Linux VM in GCP
Add notifications
To make it actually useful in practice, we need to add notifications. You can check my other post on how to do that:
- [How to deploy your program to run periodically in a Linux VM in GCP Compute Engine][hot-to-deploy]
We need to implement the following simple logic:
# Code to scrape the website.
...
if result["proxima_apertura"] != "fecha por confirmar":
# Code to send the email.
...
Of course in your case you might need to adapt the code to the expected result that you'll have. And, also, instead of checking for equality or inequality, you might want to use a regex or something more flexible than a simple inequality (but for the purpose of this post, that is beyond the scope).
Deploy to VM
Finally we can deploy it to a VM so it can run there periodically in the cloud. To see how to do that, you can check my other post:
Once that is done, make sure that the cronjob that you set up to run your scraper is not aggressive on the target website, there is no need to put strain on their servers and force them to take measures against your scraper. Running it every hour or 30 min, in my case, is more than enough, and it's not something that will case the website any stress.
Conclusion
And done! I showed you how to set up a Python scraper to automate the task of checking a website for you, and notify you whenever a desired change happens.
Do check my github repo to see the full code with logging, error handling, etc...
And as usual, feel free to ask any questions in the comments and I'll check them out when I can.
Cheers!
Top comments (0)