loading...

Scraping websites with NodeJS

cpow profile image Chris Power Originally published at browntreelabs.com ・3 min read

I'm currently working on a side project where I want to scrape and store the blog posts on certain pages. For this project I chose to use NodeJS. I have been working more with javascript lately so I figured this would be a fun thing to do with Node instead of Ruby, Python, whatever.

The tooling

There are two really great tools to use when scraping websites with NodeJs: Axios and Cheerio

Using these two tools together, we can grab the HTML of a web page, load it into Cheerio (more on this later), and query the elements for the information we need.

Axios

Axios is a promise based HTTP client for both the browser, and for NodeJS. This is a well known package that is used in tons and tons of projects. Most of the React and Ember projects I work on use Axios to make API calls.

We can use axios to get the HTML of a website:

  import axios from 'axios';

  await axios.get('https://www.realtor.com/news/real-estate-news/');

☝️ will give us the HTML of the URL we request.

Cheerio

Cheerio is the most amazing package I never heard of until now. Essentially, Cheerio gives you jQuery-like queries on the DOM structure of the HTML you load! Its amazing and allows you to do things like this:

  const cheerio = require('cheerio')
  const $ = cheerio.load('<h2 class="title">Hello world</h2>')

  const titleText = $('h2.title').text();

If you're at all familiar with JS development, this should feel very familiar to you.

The final Script

With Axios and Cheerio, making our NodeJS scraper is dead simple. We call a URL with axios, and load the output HTML into cheerio. Once our HTML is loaded into cheerio, we can query the DOM for whatever information we want!

import axios from 'axios';
import cheerio from 'cheerio';

export async function scrapeRealtor() {
  const html = await axios.get('https://www.realtor.com/news/real-estate-news/');
  const $ = await cheerio.load(html.data);
  let data = [];

  $('.site-main article').each((i, elem) => {
    if (i <= 3) {
      data.push({
        image: $(elem).find('img.wp-post-image').attr('src'),
        title: $(elem).find('h2.entry-title').text(),
        excerpt: $(elem).find('p.hide_xxs').text().trim(),
        link: $(elem).find('h2.entry-title a').attr('href')
      })
    }
  });

  console.log(data);
}

The output

We now have our scrapped information!

[ { image:
     'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/iStock-172488314-832x468.jpg',
    title:
     'One-Third of Mortgage Borrowers Are Missing This Opportunity to Save $2,000',
    excerpt:
     'Consumer advocates have an important recommendation for first-time buyers to take advantage of an opportunity to save on housing costs.',
    link:
     'https://www.realtor.com/news/real-estate-news/one-third-of-mortgage-borrowers-are-missing-this-opportunity-to-save-2000/' },
  { image:
     'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/iStock-165493611-832x468.jpg',
    title:
     'Trump Administration Reducing the Size of Loans People Can Get Through FHA Cash-Out Refinancing',
    excerpt:
     'Cash-out refinances have grown in popularity in recent years in tandem with ballooning home values across much of the country.',
    link:
     'https://www.realtor.com/news/real-estate-news/trump-administration-reducing-the-size-of-loans-people-can-get-through-fha-cash-out-refinancing/' },
  { image:
     'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/08/GettyImages-450777069-832x468.jpg',
    title: 'Mortgage Rates Steady as Fed Weighs Further Cuts',
    excerpt:
     'Mortgage rates stayed steady a day after the Federal Reserve made its first interest-rate reduction in a decade, and as it considers more.',
    link:
     'https://www.realtor.com/news/real-estate-news/mortgage-rates-steady-as-fed-weighs-further-cuts/' },
  { image:
     'https://rdcnewsadvice.wpengine.com/wp-content/uploads/2019/07/GettyImages-474822391-832x468.jpg',
    title: 'Mortgage Rates Were Falling Before Fed Signaled Rate Cut',
    excerpt:
     'The Federal Reserve is prepared to cut interest rates this week for the first time since 2008, but the biggest source of debt for U.S. consumers—mortgages—has been getting cheaper since late last year.',
    link:
     'https://www.realtor.com/news/real-estate-news/mortgage-rates-were-falling-before-fed-signaled-rate-cut/' } ]

Posted on by:

cpow profile

Chris Power

@cpow

Father, Husband, Human, lets see... what else...

Discussion

markdown guide
 

Please consider github.com/gajus/surgeon the next time you are scraping content. The above example could be rewritten as:

import axios from 'axios';
import surgeon, {
  subroutineAliasPreset
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ...subroutineAliasPreset
  }
});

export async function scrapeRealtor() {
  const html = await axios.get('https://www.realtor.com/news/real-estate-news/');

  const data = x([
    'sm .site-main article',
    {
      image: 'so img.wp-post-image | ra src',
      title: 'so h2.entry-title | rdtc',
      excerpt: 'so p.hide_xxs | rdtc',
      link: 'so h2.entry-title a | ra href'
    }
  ], html);

  console.log(data);
}

Apart from being shorter, each selector is also an assertion – this ensures that if the remote document changes, Surgeon will notify of which selectors are returning unexpected results.