Web scraping is something I never thought I'd do. I'm primarily a UI developer, although my career started as a backend developer. I've never been asked at a job to perform any scraping tasks, and I'd never had a personal project that required me to scrape some data until recently. I'll share what I've learned, which is honestly probably just scratching the surface of what you can do with a technology like Puppeteer. In this post I'll walk you through installing and writing a Node.js script that will scrape a weather forecast website. The weather forecast site isn't my personal project that I mentioned earlier, it's more of a contrived example of how to get started with web scraping using Node.js and Puppeteer.
This post assumes that you have some knowledge of Node.js, NPM, async/await, and the command line.
This is intended to be a multi-series post/tutorial. We'll start off slow, and eventually have a project that will net us a 10 day weather forecast in json
format.
Disclaimer: I'm no expert in Node.js and certainly not a Puppeteer expert, so, if you see anything that could be done in a more proficient manner, or that's just plain wrong, please let me know. Thanks.
Note: I'll be using npm
in this post. Feel free to use your preferred package manager if you're comfortable doing so.
Project Setup
The website that we'll be scraping is weather.com. We'll work towards getting the 10-day forecast of Austin, Texas. You can most certainly swap out the city for your preferred city.
Let's go ahead and create a new directory for our project:
mkdir weather-scraper
Now navigate in to the directory:
cd weather-scraper
Let's initialize the project (I'll use the -y
flag to skip all the questions):
npm init -y
Next, I'm going to go ahead and open my favorite editor and create a JavaScript file. I'll name mine scraper.js
.
Before we get too far, let's add one line to the package.json
file so that we can use the import declaration. Add this line:
"type": "module",
Your package.json
should look something like this:
{
"name": "weather-scraper",
"type": "module",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"puppeteer": "^19.6.2"
}
}
Let's now install Puppeteer.
npm i puppeteer
First Few Lines of Code
First things first, let's import puppeteer on line 1:
import puppeteer from 'puppeteer';
We'll create an async function called scrape
and write out the first bit of code. The reason that this function is an async function will become quite clear in just a bit (we basically have to await everything).
async function scrape() {
const browser = await puppeteer.launch({ dumpio: true });
const page = await browser.newPage();
}
We've created two variables, browser
and page
. The browser
variable is created using puppeteer's launch
method, which has a return type of Promise<Browser>
. See this page for more info on the Browser type.
The page
variable is created using the browser context's newPage
method, which returns a type of Promise<Page>
. See this page for more info on the Page class.
The Page class has a method that we'll use to navigate to the weather website we're trying to scrape, the goto
method. This method takes in one parameter, the url, along with an optional options parameter which we won't use at this time. This method returns a type of Promise<HttpResonse | null>
.
We'll add that next:
await page.goto('https://weather.com/weather/tenday/l/Austin+TX')
Our scraper.js
file should look like this at this point:
import puppeteer from 'puppeteer';
async function scrape() {
const browser = await puppeteer.launch({ dumpio: true });
const page = await browser.newPage();
await page.goto('https://weather.com/weather/tenday/l/Austin+TX')
}
Let the Scraping Begin
After the goto
method, we can use the Page
class's evaluate
method to work some magic. Inside of the evaluate
method is where we can write out function inside the page's context and have the result returned. Essentially, we'll write some code inside this method to get the data from the page that we want to get. I'm going to put the code below, then discuss:
async function scrape() {
const browser = await puppeteer.launch({ dumpio: true });
const page = await browser.newPage();
await page.goto("https://weather.com/weather/tenday/l/Austin+TX");
const weatherData = await page.evaluate(() =>
Array.from(
document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV"),
(e) => ({
date: e.querySelector("h3").innerText,
})
)
);
await browser.close();
return weatherData;
}
The evaluate
method is a higher-order function, meaning we can pass in another function as a parameter, which is exactly what we're doing. Inside the anonymous function that we pass in, we have access to the document
object. If you go inspect the weather.com url that I shared, you should be able to find the DaypartDetails--DayPartDetail--2XOOV
class.
We are using the Array.from
method and passing in document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV")
, which will return a NodeList
of all elements in the document with that class. The Array.from
method has a second, optional parameter, which is a mapping function. We are using that mapping function to select all h3
elements inside of each element in the NodeList
and assigning the value of it's innerText
to a property that we call date
;
Viewing Our Data
After the scrape
function, add in these two lines and save:
const scrapedData = await scrape();
console.log(scrapedData);
Let's go to the terminal and run:
node scraper.js
We get results that look something like this:
[
{ date: 'Tonight' },
{ date: 'Tue 31' },
{ date: 'Wed 01' },
{ date: 'Thu 02' },
{ date: 'Fri 03' },
{ date: 'Sat 04' },
{ date: 'Sun 05' },
{ date: 'Mon 06' },
{ date: 'Tue 07' },
{ date: 'Wed 08' },
{ date: 'Thu 09' },
{ date: 'Fri 10' },
{ date: 'Sat 11' },
{ date: 'Sun 12' },
{ date: 'Mon 13' }
]
Very cool.
The entire scraper.js
file looks like this:
import puppeteer from "puppeteer";
async function scrape() {
const browser = await puppeteer.launch({ dumpio: true });
const page = await browser.newPage();
await page.goto("https://weather.com/weather/tenday/l/Austin+TX");
const weatherData = await page.evaluate(() =>
Array.from(
document.querySelectorAll(".DaypartDetails--DayPartDetail--2XOOV"),
(e) => ({
date: e.querySelector("h3").innerText,
})
)
);
await browser.close();
return weatherData;
}
const scrapedData = await scrape();
console.log(scrapedData);
Wrapping Up
I implore you to play around with this. See if you're able to scrape other bits of data. In my next post I'll continue where I left off here. By the end of my web scraping posts I'll have shown you how to scrape, create a GitHub Action to automatically scrape, and have the GitHub Action save the data in to a .json
file in the same repo.
I hope you found this post interesting. Thanks for sticking around. Until next time.
Top comments (0)