Intro to web scraping (w/ Node.js example)

#webscaping #javascript #beginners

When it comes to scraping the web, Python definitely is king. Frameworks like scrapy and beautiful soup make parsing through raw HTML (relatively) simple and can be used to build a basic scraping tool in minutes. Fortunately for javascript developers, there are also some pretty cool tools out there to accomplish much of the same when it comes to scraping the web. This post will provide a brief intro to scraping using node.js and cheerio. We will also build our very own web scraper to extract image urls from the website of our choice!

What is web scraping?

According to live internet stats, there are more than 1.7 billion websites that can be found on the internet today. It is estimated that google knows about more than 130 trillion pages (2016 estimate, most recent I could find...). Basically, there is A LOT of data out there. Web scrapers are tools that help us sift through the madness. In their simplest form, they request the html from a webpage, and quickly sort through it to find a target as specified by the programmer. This contact information, phone numbers, embedded links -- really anything you could think of that exists in that raw html request. So you might be thinking, aren't APIs built for sharing data? Yes, but many websites don't have APIs and even those that do may not want you to have easy access organized information that their pages may contain. It is up to web scrapers to do the dirty work for us.

Is web scraping legal?

Before we get into actually building a web scraper, it is important to note that some websites are not ok with you scraping them. Companies like craigslist have even been awarded millions of dollars as the result of legal action taken against other companies that scraped their sites. So it is always a good idea to check out the robots.txt file for a website before you try and scrape them. This can be found by appending robots.txt to the end most sites' domain name. Below is what this looks like for craigslist:

What you need to know here is that it is not ok to make a program (bot) that makes requests to these endpoints. You should also checkout the websites terms of use, usually found in the footer or about page. So do your homework before you get started. For the example below, we will be making requests to http://books.toscrape.com/ which is a site set up specifically to practice web scraping.

Building a simple web scraper

Prerequisites: must have node installed.

Make a new directory with the name of your choice and run:
- npm init
install dependencies. We will be using axios to make http requests and cheerio to help us parse the html that we get back.
- npm install --save cheerio axios
create a file for our scraper code:
- touch index.js
Since our scraper is going to be making an http request, we need to be able to wait for our a response. Axios returns a promise out of the box, thus we can use a .then() in which we will have access to the html we want to set. Below is the basic setup for our axios request

const axios = require('axios');
const cheerio = require('cheerio');

axios('http://books.toscrape.com/')
  .then((response) => {
    // our scraping code will go here!
  })
  .catch(() => console.log('something went wrong!'))

The html string that we want will be stored on the data property of the response from axios. We now want to load this html into the cheerio package that we downloaded earlier. Add the following to our .then() block:

const $ = cheerio.load(response.data);

Cheerio processes the html string and will allow us to select html tags, classes, id's, attributes, and tag contents almost exactly like we would be able to in jquery. Let's log the uri from the first img tag's src in the html for books.toscrape page. Add the following:

const firstUrl = $('body').find('img').attr('src')
console.log(firstUri)

Notice that we first select the body tag. The .find() method selects the very first img tag found within the body tag. Finally, the .attr() allows us to select the contents of the src attribute within that first img tag. Even for some as simple as a photo url, it definitely takes a bit of investigation, right ?!

Let's see are code in action! In our terminal, run:
node index.js Your code may take some time to run. This is because we have to wait for our axios request to be completed and it takes cheerio a little while to parse all that html. If you are connected to the internet, you should see a uri for an image printed out in your console. Here's what I got:

While this example is admittedly basic, imagine being able to create a bot that grabs all of the image URIs from a website with dynamic website every day, without you having to lift a finger! We can even have our web scraper find the next page button, giving it the ability to crawl across web pages, even jumping to new ones along the way!

In a perfect world, every website would create a beautiful, well-documented api with open access granted to anyone who wishes. In the meantime, web scrapers do the trick. Have fun trying them out on your own!

Below is the complete code for the super basic image uri scraper:

const axios = require('axios');
const cheerio = require('cheerio');

axios('http://books.toscrape.com/')
  .then((response) => {
    const $ = cheerio.load(response.data);
    const firstUrl = $('body').find('img').attr('src')
    console.log(firstUrl)

  })
  .catch(() => console.log('something went wrong!'))