Discussion on: Scraping websites with NodeJS

View post

Please consider github.com/gajus/surgeon the next time you are scraping content. The above example could be rewritten as:

import axios from 'axios';
import surgeon, {
  subroutineAliasPreset
} from 'surgeon';

const x = surgeon({
  subroutines: {
    ...subroutineAliasPreset
  }
});

export async function scrapeRealtor() {
  const html = await axios.get('https://www.realtor.com/news/real-estate-news/');

  const data = x([
    'sm .site-main article',
    {
      image: 'so img.wp-post-image | ra src',
      title: 'so h2.entry-title | rdtc',
      excerpt: 'so p.hide_xxs | rdtc',
      link: 'so h2.entry-title a | ra href'
    }
  ], html);

  console.log(data);
}

Apart from being shorter, each selector is also an assertion – this ensures that if the remote document changes, Surgeon will notify of which selectors are returning unexpected results.