A cron job is a way to run a program on a server at a specified interval. This is often a small script that does some repeatable task, like collect metrics, check the health of a service, take a snapshot of some dataset, etc.
The cron utility has a highly customizable format, called a crontab file, which allows us to run essentially any command at whatever interval we want. You can have your script run as often as once a minute, and as infrequently as once a year.
Most often, a cron job takes some input, does a little processing, and then generates some output. The input could be a web page, another application, a database, etc. The output might be anything from adding a row to a table in a database, to putting a file in storage, to sending a message.
Cron jobs have a lot of uses: essentially anything you want to happen in a repeatable way can be made into a cron job.
One thing I like to use cron jobs for is to keep an eye on websites that I visit infrequently. For example, I'd like to see articles posted to Hacker News about Python, but I don't have time to check it every day, and look at the title of every post to see if it's about Python.
Instead, I can write a cron job to do this for me. The steps will be roughly as follows:
- Once a day, scrape https://news.ycombinator.com/front;
- Filter out all the post titles;
- If any match "python", send me an email with links
Here's the script:
#!/usr/bin/python import re import requests from bs4 import BeautifulSoup # Make a request to the page resp = requests.get("http://news.ycombinator.com/front") # Parse the HTML soup = BeautifulSoup(resp.content, 'html.parser') # Filter out story links, that match 'python' links = [ link for link in soup.find_all("a", class_="storylink") if re.search('python', link.text, re.IGNORECASE) ] # If there are any, send an email if links: send_email(links)
This makes a request to the page, parses the HTML with Beautiful Soup, filters out links that have "python" in the title, and sends me an email (we'll leave that function as an exercise to the reader).
If I wanted to set this up as a cron job on my Linux machine, I would save it to a file (
send_me_pythons.py), make it executable (
chmod 755 send_me_pythons.py) and put it in my local bin (
Then, I would edit my crontab (
/etc/crontab) and add the following line:
0 0 * * * /usr/local/bin/send_me_pythons.py
This runs the script once a day at midnight. I like to use https://crontab.guru/ to ensure that my crontab is right before I set it. Here's a diagram of each field:
# 0 0 * * * /usr/local/bin/send_me_pythons.py # │ │ │ │ │ └── run this script # │ │ │ │ └──── every year # │ │ │ └────── every month # │ │ └──────── every day # │ └────────── at hour zero # └──────────── at minute zero
Once the crontab is set, my machine will pick up on the new cron job and run it as often as I've specified.
This is great, but it has one huge downside: now I have to keep my server up and running to ensure that the emails get sent, otherwise I'll be missing valuable Python content. And this is further complicated by the fact that either I should be using this server for other things besides running this little script once a day, or I need to keep an entire server up and running just for this little script! Not ideal.
Instead, let's take this function to the cloud. First, we'll turn it into a Google Cloud Function. We'll wrap our entire existing script in a Python function, and put it in a file called
import re import requests from bs4 import BeautifulSoup def send_pythons(request): # Make a request to the page resp = requests.get("http://news.ycombinator.com/front") # Parse the HTML soup = BeautifulSoup(resp.content, 'html.parser') # Filter out story links, that match 'python' links = [ link for link in soup.find_all("a", class_="storylink") if re.search('python', link.text, re.IGNORECASE) ] # If there are any, send an email if links: send_email(links)
Note: Only the lines that need to happen every time the function is called actually need to be in the
send_pythons function. The import statements only need to be executed once when the function is loaded, and can be left outside the function.
Next, we'll define our dependencies. We're using
beautifulsoup4, so we'll need to put them in our
Then, we can deploy this with the
gcloud command line tool:
$ gcloud beta functions deploy test --runtime python37 --trigger-http
This will give us an endpoint, something like:
And making an HTTP
GET request to that endpoint will result in our function being run.
Now we've got our script as a function in the cloud, all we need to do is schedule it. We can create a new Google Cloud Scheduler job with the
gcloud command line tool:
$ gcloud beta scheduler jobs create http send_pythons_job \ --schedule="0 0 * * *" \ --uri=https://us-central1-dont-click-this.cloudfunctions.net/send_pythons
This specifies a name for our job,
send_pythons_job, which is unique per-project. It also specifies the crontab schedule we set above, and points the job to our HTTP function we created.
We can list our job:
$ gcloud beta scheduler jobs list --project=$PROJECT ID LOCATION SCHEDULE (TZ) TARGET_TYPE STATE send_pythons_job us-central1 0 0 * * * (Etc/UTC) HTTP ENABLED
And if we want to run our job out of schedule, we can do it from the command line:
$ gcloud beta scheduler jobs run send_pythons_job
There's lots more you can do with Cloud Functions + Cloud Scheduler! Follow the links below to learn how to:
- Schedule more complex tasks with Cloud Scheduler
- Use Cloud Scheduler and Pub/Sub to trigger a Cloud Function
- Send email from App Engine with Mailjet or SendGrid
All code © Google w/ Apache 2 license