Note: An updated version of this working version tutorial can be found here.
“Scraping” can be used to collect and analyse data from sources that don’t have API’s.
In this tutorial we’ll scrape content using JavaScript from a website that’s rendered server-side.
You’ll need to have Node.js and npm installed if you haven’t already.
Let’s start by creating a project folder and initialising it with a package.json file:
mkdir scraper
npm init -y
We’ll be using two packages to build our scraper script.
- axios – Promise based HTTP client for the browser and node.js.
- cheerio – Implementation of jQuery designed for the server (makes it easy to work with the DOM).
Install the packages by running the following command:
npm install axios cheerio --save
Next create a file called scrape.js and include the packages we just installed:
const axios = require("axios");
const cheerio = require("cheerio");
In this example i’ll be using https://lobste.rs/ as the data source to be scraped.
Inspecting the code the site name in the header has a cur_url
class so let’s see if we can scrape it’s text:
Add the following to scrape.js to fetch the HTML and log the title text if successful:
axios('https://lobste.rs/')
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const title = $(".cur_url").text();
console.log(title);
})
.catch(console.error);
Run the script with the following command and you should see Lobsters
logged in the terminal:
node scrape.js
If everything’s working we can proceed to scrape some actual content from the website.
Let’s get the titles, domains and points for each of the stories on the homepage by updating scrape.js:
axios("https://lobste.rs/")
.then((response) => {
const html = response.data;
const $ = cheerio.load(html);
const storyItem = $(".story");
const stories = [];
storyItem.each(function () {
const title = $(this).find(".u-url").text();
const domain = $(this).find(".domain").text();
const points = $(this).find(".score").text();
stories.push({
title,
domain,
points,
});
});
console.log(stories);
})
.catch(console.error);
This code loops through each of the stories, grabs the data, and then stores it in an array called stories
.
If you’ve worked with jQuery then the selectors will be familiar, if not you can learn about them here.
Now re-run node scrape.js
and you should see the data for each of the stories:
Top comments (0)