DEV Community

loading...
Cover image for I  wrote a crawler for the first time.

I wrote a crawler for the first time.

kaylasween profile image Kayla Sween ・6 min read

Early on in the pandemic, I decided that I wanted a way to track the moving average of cases per day in my state, Mississippi, since that wasn't something our Department of Health had a graph for at the time. Since I thought, "you know, this won't be too long... I could definitely just do this for a few months," I had been manually adding data for every single day until the end of January. I would frequently forget or just not want to look at the data for a month or more at a time. I realized I needed to find a way to automate this process so I didn't have to go back through the last month's data to update my graph. So, I decided to finally write a crawler to get all that data from our state’s Department of Health website without even thinking about it.

The Crawler

For me, this was the easy part. I wanted to write a web crawler in a language I was comfortable with to get it up relatively quickly, so I decided on JavaScript. I took bits and pieces from various tutorials I had found and decided on using Axios to grab the data and Cheerio to parse it.

To start out, I added Axios and Cheerio to my site.

for yarn: yarn add axios cheerio
for npm: npm install axios cheerio

Then, I included them in the JavaScript file I used for my crawler code.

const axios = require('axios')
const cheerio = require('cheerio')

You could also do it the ✨ES6 way✨:

import axios from 'axios'
import cheerio from 'cheerio'

I also included my JSON file and filestream so I could add the newest data to that JSON file.

const fs = require('fs')
const data = require('../src/constants/covidData.json')

Then, I created a function to get the latest cases for the day off of the MSDH website. I fetched the data with Axios, loaded it into Cheerio, and then pulled the value out of the section of the DOM that contained the current day's data. I found this selector by going into the dev tools in the browser and looking for the section of the page that contained the daily case data. In this case, there was a data-description attribute on a p tag that helped me locate the correct HTML element. I removed any commas from the string it returned and made sure that it was getting saved as an integer so it would work with my charts.

const msdh = 'https://msdh.ms.gov/msdhsite/_static/14,0,420.html'
const getDailyCases = async () => {
  try {
    const { data } = await axios.get(msdh)
    const $ = cheerio.load(data)
    let dailyCases = parseInt($('[data-description="New cases"]').text().replace(/,/g, ''))

    return dailyCases
  } catch (error) {
    console.log(error)
  }
}

I created a new date object. And since All data is from the previous day, I set the date to the day before.

let today = new Date()
today.setDate(today.getDate() - 1)

And then initialized my data object to eventually add those two pieces of information to an object to add to my JSON file.

let dailyCases = {
    newCases: 0,
    date: today.getFullYear() + '-' + today.getMonth() + '-' + today.getDate() //formatting date to match what I needed
}

Finally, I wrote another async function to call my getDailyCases function and, after it gets that data, add it to my JSON file as long as there are new cases, and that date doesn't exist in the JSON file.

const getCovidData = async () => {
  dailyCases.newCases = await getDailyCases()

  if (!data.data.includes(daily.date) && daily.newCases != 0) {
    data.data.push(dailyCases)

    fs.writeFile('src/constants/covidData.json', JSON.stringify(data), (error) => {
      if (error) {
        console.log(error)
      }
    })
  }
}

And, of course, call that function so that it'll actually run.

getCovidData()

That's all there is to the crawler! You can check out the full crawler file on my GitHub.

Getting it to run regularly

My first thought was to use a combination of Netlify functions to run the web crawler and Zapier to schedule the daily deployment. I quickly realized this wasn't going to work. Since my database was just a JSON file in my GitHub repo, I needed to make sure that the data was getting added every day. When I tried using the Netlify/Zapier combination, it would run the crawler and "overwrite" the last entry daily, since that data wasn't getting pushed back to GitHub.

After that didn't pan out, I decided to try GitHub Actions, which I had never used before. (Spoiler, this is what I ended up using.)

I just jumped right into GitHub Actions without any real research or planning. Normally, that's not something I'd recommend. However, it worked out pretty well this time because of how well the default YAML file was commented. I used a lot of the default YAML file for the action.

To get the Action to run daily, I used POSIX cron syntax to set the interval.

on:
  schedule:
    - cron: "00 20 * * *"

Each of those places separated by spaces represents a unit of time. This will determine how often your Action will run. A lot of times, you may see that denoted by five asterisks ("* * * * *"). The first place is the minute field. The second place is the hour (which hour in UTC). The third is the day. The fourth is the month (1-12 or JAN-DEC). Finally, the fifth place is the day of the week (0-6 or SUN-SAT). If you leave any of these as a star, it'll run for every one of those units of time. In my code, I wanted my Action to run every day at UTC 20:00 (or 2PM CST) to ensure the Department of Health had time to publish data that day. Therefore, I only put units of time in the minute and hour places and left the rest as asterisks.

Once I determined how often I needed it to run, I needed to define what the actual job (with steps!) was that I need it to run. So I set up Node.js, installed my dependencies (Axios and Cheerio), ran my crawler, and then pushed the changes to my repository.

jobs:
  # This workflow contains a single job called "build"
  build:
    # The type of runner that the job will run on (I left it as the default)
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
      
      - name: Setup Node.js environment
        uses: actions/setup-node@v2.1.4
    
      - name: Install axios and cheerio
        run: |
          npm install axios
          npm install cheerio
    
      - name: Get Covid Data
        run: |
          node functions/crawler.js
          
      - name: Push changes
        uses: actions-go/push@v1
        with:
          # The commit message used when changes needs to be committed
          commit-message: "running daily COVID data crawler"

That's all there is to it! Now the web crawler is running every day! You can check out the GitHub Action file on my GitHub.

You can also see the final product in action on the COVID-19 page on my website.

Senior-ish developers get intimidated too.

Writing a web crawler was something I put off for a LONG time in my career. It was probably the first thing I was asked to do as a developer (which I didn't). Quite honestly, it intimidated me a lot and took me around 9 years to get over that intimidation. I just assumed that I wouldn't be able to do it, and I let that consume me. Now, every single time I see that commit message "running daily COVID data crawler," I feel so proud. I've built many things throughout my career, but this may be the thing I'm most proud of because I proved to myself that I could do it.

Let this be a lesson for new developers that sometimes things don't get less scary. You just get less afraid of failing.

Illustration from Undraw

Discussion (24)

pic
Editor guide
Collapse
functional_js profile image
Functional Javascript

Great work Kayla!

A quick tip. If you use the async-await pattern, you don't need the ".then" pattern.

const getCovidData = async () => {
  try {
    daily.newDeaths = await getDailyDeaths();
    daily.testsRun = await getTestsRun();
    daily.newCases = await getDailyCases();
    daily.totalCases = await getTotalCases();

//....
Enter fullscreen mode Exit fullscreen mode
Collapse
hasnaindev profile image
Muhammad Hasnain

The problem with this approach is that it blocks all the next await chains. Instead, opt for something like this.

const promises = [
  getDailyDeaths(),
  getTestsRun(),
  getDailyCases(),
  getTotalCases(),
];

// If you don't want to use await just go with .then in the line below
const resolvedPromises = await Promise.all(promises);
Enter fullscreen mode Exit fullscreen mode
Collapse
functional_js profile image
Functional Javascript

You could do that in certain cases. But there are only 4 of them. Plus by doing it serially to the same domain, you're not overwhelming the one server endpoint in this case.

Actually, as a robustness strategy, with multiple calls to the same domain, you typically want to insert manual delays to prevent a server rejections. I'll actually place a "sleep" between calls. It's not a race to get the calls done as fast as possible. The goal is robustness.

Also, this is a background job, so saving a second or two is not a critical performance criteria here. In this case one request isn't dependent on another so it's fine, but if it was you'll want to run them serially.

Also with Promise.all, if one fails, they all fail. It's less robust. With the serial approach each request is atomic and can succeed on it's own. I.E. Getting 3 out of 4 successful results is better than getting 0 out of 4 successes, even though some of them succeeded.

Also you now have an array of resolved promises that you still have to loop through and process. With the serial approach, it was done in 4 lines. Much easier to grok. Much easier to read. Much easier to debug.

If I had to do dozens of requests that had no order dependency on each other, and they all went to various domains, and 100% of them had to succeed otherwise all fail, then Promise.all is certainly the way to go. If you have 2 or 3 or 4 requests, there's really no compelling benefit. Default to simple.

So there are pros and cons to each approach to consider.
Thanks for the input!

Thread Thread
hasnaindev profile image
Muhammad Hasnain

Ahan, thank you for presenting the case in such detail. These things never occurred to me and now I know better. :)

Thread Thread
robloche profile image
Robloche

The point about robustness is valid but for the sake of it, I'll mention that the potential issue with Promise.all can be avoided by using Promise.allSettled instead.

Collapse
kaylasween profile image
Kayla Sween Author

Good to know! Thanks for the tip! I’ll change that in my code!

Collapse
raddevus profile image
raddevus

This is a really nice write-up and a great idea.

Are you running Ubuntu or some other Linux version as your desktop OS? Just curious.
I run Ubuntu 20.04 (I switched to Ubuntu about 2 years ago, after using windows for over 25 years...seriously).
Thanks for sharing this interesting article.

Collapse
kaylasween profile image
Kayla Sween Author

Thanks! Nope, Ubuntu was just the default option in that YAML file, and I didn’t change it because it worked just fine. I’m an OSX person.

Collapse
kevinhickssw profile image
Kevin Hicks

I had to write some crawlers when I first started. It definitely is one of those things you start out having no clue how to do it and thinking it's going to be extremely tough.

I like the simple approach you took to store the data in a JSON file and use Github Actions to update it. It's nice when we can keep things simple instead of spinning up databases and complex infrastructure.

Also, congratulations on getting over the intimidation of doing this. Us senior developers definitely do get afraid of tasks. We need to remember that we can always learn how to do something, just like how we didn't know how to write code before we were developers.

Collapse
kaylasween profile image
Kayla Sween Author

Definitely! Keeping it simple was really important to me because I didn’t want to spend a long time getting it set up or potentially spending longer maintaining it.

Thanks! Yeah this is one that’s been haunting me for sure, so it gave me a good confidence boost! And that’s very true!

Thanks for reading!

Collapse
pashacodes profile image
pashacodes

Thank you so much for sharing, I have been really curious about writing cron's, it was so helpful to read through your thought process, and it seems a lot less intimidating now

Collapse
kaylasween profile image
Kayla Sween Author

Awesome!!! So glad I could help. Thank you for reading!

Collapse
crimsonmed profile image
Médéric Burlet

I used cheerio a while but realized when processing a lot of pages it's very slow.
I'd recommend using regex I cut down processing time by 80%.

Good job though it's a great start!

Collapse
kaylasween profile image
Kayla Sween Author

Good to know! I’m just collecting data from one page for now, but I’ll definitely keep that in mind. Thank you!

Collapse
codingtomusic profile image
Tom Connors

As you say, I've not done it and thank you for showing how!!

Collapse
kaylasween profile image
Kayla Sween Author

Thank you for reading! 😊

Collapse
localstorage profile image
Michael Hungbo

Amazing tutorial Kayla. And congratulations for getting over the intimidation you had to live with for 9yrs! It's a win and I'm happy you got over it.

Collapse
kaylasween profile image
Kayla Sween Author

Thank you so much! I’m happy I got over it too. 😅

Collapse
rafavls profile image
Rafael

That's awesome! I love your initiative. Also, am glad that cases seem to be going down lately.

Collapse
kaylasween profile image
Kayla Sween Author

Thanks! Yeah, I’m pretty thankful for that. I believe we have more people vaccinated than have tested positive in Mississippi at this point, so I’m optimistic!

Some comments have been hidden by the post's author - find out more