Scraping Wikipedia for data using Puppeteer and Node

#webdev #node #javascript #puppeteer

Intro

In this article, we'll go through scraping a Wikipedia table with COVID-19 data using Puppeteer and Node. The original article that I used for this project is located here.

I have never scraped a website before. I've always seen it as a hacky thing to do. But, after going through this little project, I can see the value of something like this. Data is hard to find and if you can scrape a website for it, in my opinion, by all means, do it.

Setup

Setting up this project was extremely easy. All you have to do is install Puppeteer with the command npm install puppeteer. There was one confusing issue I had during setup, however. The puppeteer package was not unzipped correctly when I initially installed it. I found this out while running the initial example in the article. If you get an error that states Failed to launch browser process or something similar follow these steps:

Unzip chrome-win from node_modules/puppeteer/.local-chromium/
Then add that folder to the win64 folder in that same .local-chromium folder.
Make sure the chrome.exe is in this path node_modules/puppeteer/.local-chromium/win64-818858/chrome-win/chrome.exe
This is for windows specifically. Mac might be similar, but not sure.

Here is the link that lead me to the answer. It might be a good idea to do this no matter what to make sure everything is functioning properly.

The code

I had to make a couple of small changes to the existing code.

First example

The first example didn't work for me. To fix the problem I assigned the async function to a variable then invoked that variable after the function. I'm not sure this is the best way to handle the issue but hey, it works. Here is the code:

const puppeteer = require('puppeteer');

const takeScreenShot = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.stem-effect.com/');
    await page.screenshot({path: 'output.png'});

    await browser.close();
};

takeScreenShot();

Wikipedia scraper

I also had an issue with the Wikipedia scraper code. For some reason, I was getting null values for the country names. This screwed up all of my data in the JSON file I was creating.

Also, the scraper was 'scraping' every table on the Wikipedia page. I didn't want that. I only wanted the first table with the total number of cases and deaths caused by COVID-19. Here is the modified code I used:

const puppeteer = require('puppeteer');
const fs = require('fs')

const scrape = async () =>{
    const browser = await puppeteer.launch({headless : false}); //browser initiate
    const page = await browser.newPage();  // opening a new blank page
    await page.goto('https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic_by_country_and_territory', {waitUntil : 'domcontentloaded'}) // navigate to url and wait until page loads completely

    // Selected table by aria-label instead of div id
    const recordList = await page.$$eval('[aria-label="COVID-19 pandemic by country and territory table"] table#thetable tbody tr',(trows)=>{
        let rowList = []    
        trows.forEach(row => {
                let record = {'country' : '','cases' :'', 'death' : '', 'recovered':''}
                record.country = row.querySelector('a').innerText; // (tr < th < a) anchor tag text contains country name
                const tdList = Array.from(row.querySelectorAll('td'), column => column.innerText); // getting textvalue of each column of a row and adding them to a list.
                record.cases = tdList[0];        
                record.death = tdList[1];       
                record.recovered = tdList[2];   
                if(tdList.length >= 3){         
                    rowList.push(record)
                }
            });
        return rowList;
    })
    console.log(recordList)
    // Commented out screen shot here
    // await page.screenshot({ path: 'screenshots/wikipedia.png' }); //screenshot 
    browser.close();

    // Store output
    fs.writeFile('covid-19.json',JSON.stringify(recordList, null, 2),(err)=>{
        if(err){console.log(err)}
        else{console.log('Saved Successfully!')}
    })
};
scrape();

I wrote comments on the subtle changes I made, but I'll also explain them here.

First, instead of identifying the table I wanted to use by the div#covid19-container, I pinpointed the table with the aria-label. This was a little more precise. Originally, the reason the code was scraping over all of the tables on the page was because the IDs were the same (I know, not a good practice. That's what classes are for, right?). Identifying the table via aria-label helped ensure that I only scraped the exact table I wanted, at least in this scenario.

Second, I commented out the screenshot command. It broke the code for some reason and I didn't see the need for it if we were just trying to create a JSON object from table data.

Lastly, after I obtained the data from the correct table I wanted to actually use it in a chart. I created an HTML file and displayed the data using Google charts. You can see the full project on my Github if you are curious. Fair warning, I got down and dirty (very hacky) putting this part together, but at the end of the day, I just wanted an easier way to consume the data that I had just mined for. There could be a whole separate article on the amount of refactoring that can be done on my HTML page.

Conclusion

This project was really fun. Thank you to the author, Mohit Maithani, for putting it together. It opened my eyes to the world of web scraping and a whole new realm of possibilities! At a high level, web scraping enables you to grab data from anywhere you want.

Like one of my favorite Youtubers, Ben Sullins likes to say, "When you free the data, your mind will follow".

Love y'all. Happy coding!

Top comments (4)

Sailesh Dahal • Jan 6 '21

I did something similar to this.
I made an automated crawler, which takes an entry URL and crawls automatically, and dumps data in the JSON file.
You can check this here