I started working on a data science project with a friend, where we wanted to understand the sentiments of people about a few topics over a period of time. Naturally, the first part of building a project like this involves scraping data from the web. Since the time period for this project spans over months, we needed a way to automatically scrape data at regular intervals without any intervention.
If you want to skip the details and need to see the code, scroll to the bottom of the page.
The first thing that comes to mind when talking about schedulers is cron. The trusty, reliable, all-purpose cron has been with us for a very long time and a lot of projects with simpler scheduling requirements still use corn in production systems. But cron requires a Linux server to run. We could've set up a serverless container or used a small VM with Linux for our purpose but we didn't because of the alternative that we found.
Apache Airflow deserves a post of its own and I'm planning to write about it in the future. Simply put, Airflow is an open-source workflow orchestration tool that helps data engineers make sense of production data pipelines. In today's world, where data fuels a lot of applications, the reliability and maintainability of airflow pipelines leave little to be desired. Airflow executes tasks in DAGs (Directed Acyclic Graphs) where each node is a task and these tasks execute in the order that we specify. It's possible to run airflow in a distributed setting to scale it up for production requirements. But airflow is too much for small projects and it ends up adding overhead to a fast-paced light development process.
This is one of the less known features of google cloud and is the service that we used in our project. Cloud scheduler is a serverless offering of cron by google cloud. In the free tier, google lets you schedule 3 jobs for free. Note that an execution instance is different from a job which means that you can schedule 3 processes for any cron schedule. We wanted our process to run every hour and hence the trusty schedule of
0 * * * * (more info here) was implemented on the cloud scheduler.
There is an excellent Python package for scraping twitter that can be found here. I used this to extract the data for the hashtags that we wanted to collect the data for.
Google cloud functions are a serverless offering by google. A variety of tasks can be performed with these serverless functions but in this post, I'll only talk about how I used them to scrape data from twitter.
This is what happens in the cloud function
- The keyword list is downloaded from a GCS bucket
- The keyword list is used to scrape the tweets from twitter
- The tweets are stored as a csv file in GCS again
I made my script compatible with the requirements of cloud functions, set up the invoking method to an HTTP call and added some code to store data in a google cloud storage bucket. That was it! The serverless function was ready to do some web scraping.
Now that I had a serverless scheduler and a cloud function, all I had to do was connect these two. This was fairly simple. Cloud scheduler allows us to send a post request at the interval that we want. All we needed to do was to give it the right permissions to invoke the cloud function (cloud functions should not be publically invokable) And we had it, a completely serverless scraper that scrapes the web at regular schedules and allows us to build excellent datasets for our use case at zero expense.
Here is the gist of the source code. If you found this post useful, consider following me here as I plan to write more about data engineering and data science in the upcoming weeks.