Jordan Scrapes With Puppeteer

#webscraping #javascript #puppeteer #typescript

I talk a lot about puppeteer in my posts. It’s one of my favorite tools for any web automation including web scraping, testing, or just automating tasks. I stumbled upon someone asking for advice on how to do some web scraping things and I thought puppeteer was the perfect platform for the job. I was going to to my post where I talk about basic web scraping with puppeteer and then I realized I didn’t have one. This post is to remedy that.

When I use puppeteer

As a default, I try to use axios or http requests for web scraping. It’s going to be quicker and use a lot less resources. The modern web is a very javascript heavy one. There is a lot of interaction that has to happen and that is where I use puppeteer.

If I’m going to a site that is using a lot of ajax (that I can’t/don’t want to just call directly) or doing navigation strictly by javascript, that is where I’m going to use puppeteer. If I want to reduce my chance of being blocked and I’m trying to appear more human like to the place I’m scraping, I’m going to use puppeteer.

Code examples

    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

The basic start up of any puppeteer project. Initiate a browser instance and then start a new page. puppeteer.launch has a lot of useful options you can pass to it. The one I use the most often in development is headless: false. This makes the browser pop up and I can see what my script is doing. The other one I commonly use is slowMo: 250. This slows down every action that happens when I am not sure why my scrape isn’t working like I expect. slowMo accepts a milliseconds value as the parameter and because it’s slowing down EVERY action you pretty much always want to be on the lower side. For a list of all options, see here.

Puppeteer on ubuntu

puppeteer.launch also an args array. I always run puppeteer on Ubuntu in production so I am using that args option every time. Here is a sample of what I typically use. I’ve written a few articles about getting puppeteer fully installed on Ubuntu. Setting up on 16.04 and Setting up on 18.04.

const pptrArgs: puppeteer.LaunchOptions = {
    headless: true,
    ignoreHTTPSErrors: true,
    args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-infobars',
        '--window-position=0,0',
        '--ignore-certifcate-errors',
        '--ignore-certifcate-errors-spki-list',
        '--user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3312.0 Safari/537.36"'
    ]
};

browser = await puppeteer.launch(pptrArgs);

Puppeteer in scraping

Once I have a page instance ready, I just simply navigate where to go. I always try to navigate as directly as I can. For example, rather than coming to a site with puppeteer, pushing a button that takes me to their search section I’m going to try to just navigate directly to their search section.

    // Navigate where you want to go
    const url = 'https://javascriptwebscrapingguy.com';
    await page.goto(url);

$eval is the bread and butter of puppeteer scraping. It allows you to get attributes or innerHTML.

    // Get innerHTML
    const title = await page.$eval('title', element => element.innerHTML);
    console.log('title', title);

Puppeteer can click and fully interact with the page just like a normal user. This includes clicking on links or buttons to make things appear.

    // click something for navigation or interaction
    await page.click('.entry-title');

    // Click something and wait for it to complete whatever it's doing
    await Promise.all([page.click('.entry-title'), page.waitForNavigation({ waitUntil: 'networkidle2' })]);

If there is any kind of data being loaded after you click, you’ll want to wait until it’s loaded before performing your next action. Using Promise.all with both the click and then waiting for the navigation is an easy way to ensure that the page is loaded before you perform your next action. networkidle2 simply waits until there are at maximum only two network connections still active. This is a really catch for websites that maintain open network connections which a lot more common than it used to be.

Puppeteer looping through links

Web scraping is all about data collection so there will often be tables or repeated data that you need to loop through. While you can click through and navigate with puppeteer as you loop through, you will lose your browser context of the original page as soon as you navigate away. Here’s an example of a bad way to loop through and open pages.

    const links = await page.$$('.entry-title');

    // Bad way
    // Will throw "Error: Execution context was destroyed, most likely because of a navigation" because link ElementHandle is no longer visible
    for (let link of links) {
        await link.click();
    }

The best way to do this is to get the urls that you are going to navigate to into array of strings and then navigate through that, like this:

    const urls: any[] = [];
    for (let link of links) {
        const url = await link.$eval('a', element => element.getAttribute('href'));
        urls.push(url);
    }

    for (let url of urls) {
        await page.goto(url);
    }

Sometimes the website only uses javascript to open pages and so the hrefs don’t actually have links. You have to get creative in these cases. There will almost always be some kind of way to identify one from another.

An example is https://www.miamidade.realforeclose.com/index.cfm?zaction=USER&zmethod=CALENDAR. Each auction that you click will navigate to a new page but there is no anchor tag or url associated with the html element.

As I dug further, I could see that the click always navigated to

https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=10/02/2019 with AUCTIONDATE being the differentiator between the auctions. Looking at the HTML I could see that on each auction there was a dayid attribute that contained the auction data parameter that I needed.

With that I can just loop through the auctions and collect all the dayids and put them into an array. I loop through that array and then open a new page with the proper auction date.

Finally, close the browser. await browser.close(). If you don’t do this, the script will hang with the browser still open and ready to go.

THE END.

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!

The post Jordan Scrapes With Puppeteer appeared first on JavaScript Web Scraping Guy.