Jason F

Posted on Jan 31, 2023 • Edited on Feb 2, 2023

Web Scraping With Puppeteer for Total Noobs: Part 1

#javascript #beginners #tutorial #webdev

Web scraping is something I never thought I'd do. I'm primarily a UI developer, although my career started as a backend developer. I've never been asked at a job to perform any scraping tasks, and I'd never had a personal project that required me to scrape some data until recently. I'll share what I've learned, which is honestly probably just scratching the surface of what you can do with a technology like Puppeteer. In this post I'll walk you through installing and writing a Node.js script that will scrape a weather forecast website. The weather forecast site isn't my personal project that I mentioned earlier, it's more of a contrived example of how to get started with web scraping using Node.js and Puppeteer.

This post assumes that you have some knowledge of Node.js, NPM, async/await, and the command line.

This is intended to be a multi-series post/tutorial. We'll start off slow, and eventually have a project that will net us a 10 day weather forecast in json format.

Disclaimer: I'm no expert in Node.js and certainly not a Puppeteer expert, so, if you see anything that could be done in a more proficient manner, or that's just plain wrong, please let me know. Thanks.

Note: I'll be using npm in this post. Feel free to use your preferred package manager if you're comfortable doing so.

Project Setup

The website that we'll be scraping is weather.com. We'll work towards getting the 10-day forecast of Austin, Texas. You can most certainly swap out the city for your preferred city.

Let's go ahead and create a new directory for our project:
mkdir weather-scraper

Now navigate in to the directory:
cd weather-scraper

Let's initialize the project (I'll use the -y flag to skip all the questions):
npm init -y

Next, I'm going to go ahead and open my favorite editor and create a JavaScript file. I'll name mine scraper.js.

Before we get too far, let's add one line to the package.json file so that we can use the import declaration. Add this line:
"type": "module",

Your package.json should look something like this:

{
  "name": "weather-scraper",
  "type": "module",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "puppeteer": "^19.6.2"
  }
}

Let's now install Puppeteer.

npm i puppeteer

First Few Lines of Code

First things first, let's import puppeteer on line 1:

import puppeteer from 'puppeteer';

We'll create an async function called scrape and write out the first bit of code. The reason that this function is an async function will become quite clear in just a bit (we basically have to await everything).

async function scrape() {
  const browser = await puppeteer.launch({ dumpio: true });
  const page = await browser.newPage();
}

We've created two variables, browser and page. The browser variable is created using puppeteer's launch method, which has a return type of Promise<Browser> . See this page for more info on the Browser type.

The page variable is created using the browser context's newPage method, which returns a type of Promise<Page>. See this page for more info on the Page class.

The Page class has a method that we'll use to navigate to the weather website we're trying to scrape, the goto method. This method takes in one parameter, the url, along with an optional options parameter which we won't use at this time. This method returns a type of Promise<HttpResonse | null>.

We'll add that next:
await page.goto('https://weather.com/weather/tenday/l/Austin+TX')

Our scraper.js file should look like this at this point:

import puppeteer from 'puppeteer';

async function scrape() {
    const browser = await puppeteer.launch({ dumpio: true });
    const page = await browser.newPage();

    await page.goto('https://weather.com/weather/tenday/l/Austin+TX')
}

Let the Scraping Begin

After the goto method, we can use the Page class's evaluate method to work some magic. Inside of the evaluate method is where we can write out function inside the page's context and have the result returned. Essentially, we'll write some code inside this method to get the data from the page that we want to get. I'm going to put the code below, then discuss:

async function scrape() {
  const browser = await puppeteer.launch({ dumpio: true });
  const page = await browser.newPage();

  await page.goto("https://weather.com/weather/tenday/l/Austin+TX");

  const weatherData = await page.evaluate(() =>
    Array.from(
      document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV"),
      (e) => ({
        date: e.querySelector("h3").innerText,
      })
    )
  );

  await browser.close();
  return weatherData;
}

The evaluate method is a higher-order function, meaning we can pass in another function as a parameter, which is exactly what we're doing. Inside the anonymous function that we pass in, we have access to the document object. If you go inspect the weather.com url that I shared, you should be able to find the DaypartDetails--DayPartDetail--2XOOV class.

We are using the Array.from method and passing in document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV"), which will return a NodeList of all elements in the document with that class. The Array.from method has a second, optional parameter, which is a mapping function. We are using that mapping function to select all h3 elements inside of each element in the NodeList and assigning the value of it's innerText to a property that we call date;

Viewing Our Data

After the scrape function, add in these two lines and save:

const scrapedData = await scrape();
console.log(scrapedData);

Let's go to the terminal and run:
node scraper.js

We get results that look something like this:

[
  { date: 'Tonight' },
  { date: 'Tue 31' },
  { date: 'Wed 01' },
  { date: 'Thu 02' },
  { date: 'Fri 03' },
  { date: 'Sat 04' },
  { date: 'Sun 05' },
  { date: 'Mon 06' },
  { date: 'Tue 07' },
  { date: 'Wed 08' },
  { date: 'Thu 09' },
  { date: 'Fri 10' },
  { date: 'Sat 11' },
  { date: 'Sun 12' },
  { date: 'Mon 13' }
]

Very cool.

The entire scraper.js file looks like this:

import puppeteer from "puppeteer";

async function scrape() {
  const browser = await puppeteer.launch({ dumpio: true });
  const page = await browser.newPage();

  await page.goto("https://weather.com/weather/tenday/l/Austin+TX");

  const weatherData = await page.evaluate(() =>
    Array.from(
      document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV"),
      (e) => ({
        date: e.querySelector("h3").innerText,
      })
    )
  );

  await browser.close();
  return weatherData;
}

const scrapedData = await scrape();
console.log(scrapedData);

Wrapping Up

I implore you to play around with this. See if you're able to scrape other bits of data. In my next post I'll continue where I left off here. By the end of my web scraping posts I'll have shown you how to scrape, create a GitHub Action to automatically scrape, and have the GitHub Action save the data in to a .json file in the same repo.

I hope you found this post interesting. Thanks for sticking around. Until next time.

DEV Community

Web Scraping With Puppeteer for Total Noobs: Part 1

Project Setup

First Few Lines of Code

Let the Scraping Begin

Viewing Our Data

Wrapping Up

Top comments (0)

Read next

Building an Interactive Budget Calculator with Streamlit 🚀

HTTPS for Django Development Environment

Mastering Cross-Platform Development with .NET 9: New Features and Enhanced Support

Implementing Auth in .NET WebApi & SPAs: Why is it still so painful?