Intro to Web Scraping w/ Puppeteer

#javascript

I was recently challenged to learn how to perform web scraping and automated form-filling using Puppeteer and was very impressed with the simplicity and functionality of its implementation.

Puppeteer allows a user to do several things:

Scape webpages for content using HTML elements and CSS selectors to target information
Take screenshots
Create PDFs
Create CSVs
Automate simulated user interactions (click, keyboard input) to test webpage functionality

I will discuss the process of setting up Puppeteer and scraping paginated results of Craigslist listings to export to CSV (I'm using Craigslist because its HTML & CSS are easy to digest, but the logic demonstrated should work for just about any site). For more information about taking screenshots, creating PDFs, and automating user interactions (form-filling is a good place to start) check out the sources at the bottom of this post.

To get started with Puppeteer you'll want to create a directory with a JS file and install Puppeteer by running yarn add puppeteer.

Next you'll want to add the following to your JS file:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('https://sfbay.craigslist.org/d/arts-crafts/search/ara', {waitUntil: 'networkidle2'});
  await page.waitForSelector('#sortable-results > ul > li > p > a');
  await browser.close();
})();

We first open an async function and create a new instance of a Puppeteer browser. {headless: false} is an optional parameter that tells your program to open chrome to view your program run; you may omit this argument but your program will just run behind the scenes. Visualizing the execution of your program should help with debugging. Next we open a new page in the browser and navigate to a webpage (in this case Craigslist's arts & crafts). {waitUntil: 'networkidle2'} tells your program to wait until there are no more than two connections on your network to execute that action. Next we tell Puppeteer to wait until a specific selector is available on the page before resuming. This is especially important for SPAs, which may load HTML after a specific action is taken.

Now we'll run through the process of collection information and exporting to a CSV:

const puppeteer = require('puppeteer');
const createCsvWriter = require('csv-writer').createArrayCsvWriter;

(async () => {
  ...
  let listings = [];
  let moreItems = true;
  while (moreItems) {
    const tmp = await page.evaluate(() => {
      const itemLinks = '#sortable-results > ul > li > p > a';
      const itemPrices = '#sortable-results > ul > li > p > span.result-meta > span.result-price'
      const priceList = document.querySelectorAll(itemPrices)
      const itemList = document.querySelectorAll(itemLinks);
      const itemArr =  Array.from(itemList).map((itemLi) => { 
        return [itemLi.text, itemLi.href]});
      const priceArr = Array.from(priceList).map((pri) => { 
        return pri.textContent});
      for (let i = 0; i < itemArr.length ; i++) {
        itemArr[i].push(priceArr[i])
      }
      return itemArr
    });
    listings.push(...tmp)
    try {
      await page.click('#searchform > div > div.paginator.buttongroup > span.buttons > a.button.next');
      await page.waitForSelector('#sortable-results > ul > li > p > a');
    } catch (error) {
      moreItems = false;
    }
  }

  const csvWriter = createCsvWriter({
    header: [],
    path: './craigslist1.csv'
  });
  csvWriter.writeRecords(listings)
    .then(() => {
      console.log('...Done')
    })

  await browser.close();
})();

You'll notice there is one change to the top of our function - I've added the requirement for csv-writer; this will assist us later on. I've added our remaining code under ellipsis.

Our next line creates an array to contain our collected data, called listings. I then create a variable, moreItems, to represent if there are additional pages of results; its default value is true. Next we enter a while loop (for pagination) and create a variable, tmp, that will use Puppeteer's functionality to evaluate the webpage we have visited. For this CSV, I wanted to export an item's name, URL and price. I was able to access this information using two query selectors: itemLinks (which contains URLs and names) & itemPrices. I collect all of the results on the first page for each query selector, then convert the outputs to arrays that contain the specific information I desire. I then combine the two arrays (acting under the assumption that there will be the same length of each array). I then return the combined array before closing tmp and push tmp into listings.

Next I check if there are additional pages by using Puppeteer's click action to look for a 'Next' button. If the button is found I wait for the selector necessary to gather my results for the subsequent page and go back to the top of the while loop. If a 'Next' button is not found, I set moreItems to false and exit the while loop. Once we have exited the loop we should have all of the information we need and are ready to create our CSV. We added the requirement for csv-writer which allows us to accomplish this task very easily. Refer to the code provided (just make sure to provide the correct path for where you'd like the CSV to land; if you'd like it in another directory you can do that as well).

Once our program has collected all of the targeted data, we are able to close the browser. You should then be able to access the CSV file that was generated by navigating to the path that was specified in your program.

Web scraping seems to be a powerful tool to collect data that may not be available in an API and I look forward to exploring Puppeteer more!

Sources: