DEV Community

Cover image for Simple guide to Web Scraping with NodeJS
Pierre
Pierre

Posted on

Simple guide to Web Scraping with NodeJS

With one of my projects ( rocketcrew.space A job board dedicated to the space industry ), I have to collect job offers on career pages of different companies. Every site is different, but each one can be scraped with one of three methods.

1 - Look for an API

The first thing you have to look at when you want to scrape a website is the network tab of the browser dev tools.
Press F12 and go to the "Network" tab. You will be able to see every request that the site is making.
If you're lucky, you can spot an API call that the website is using to get its content information, like job offers for a career page.
So all you have to do is to use the same API request to get the website content. You can use the Axios library for example.

2 - Server Side Rendered websites

Some websites are SSR, short for Server Side Rendered. It means that all the HTML page is generated on the backend. So what we want to scrape can be found directly in the HTML, we just have to parse it.

To do this you can use Axios to get the HTML page, and Cheerio to parse it.
Cheerio allows you to parse the HTML with the same syntax as jQuery.

Here is a simple example.

const response = await axios.get(`https://website-url.com`);

const $ = cheerio.load(response.data);
const description = $('#description').html();
Enter fullscreen mode Exit fullscreen mode

3 - Client-Side Rendered websites

The last type of site you can encounter is a SPA ( Single Page Application ). In that case, the server only sends a basic HTML file and the rest of the site is generated with Javascript, client-side.
In that case, we cannot use the previous method because the GET request would only return a basic HTML file without its content.
So to scrape this kind of site, we have to simulate a browser in the backend to allow Javascript to generate the website content.
With NodeJS, we can use Puppeteer, which will allow us to create, and control a Chrome browser.

Here is a little example to get a page h1 text.

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://website-url.com");

const pageTitle = await page.evaluate(() => document.querySelector("h1").textContent);

await browser.close();
Enter fullscreen mode Exit fullscreen mode

Let me know if you have any questions!

Follow me on Twitter If you want to learn how I am building RocketCrew!
https://twitter.com/siglavesc2

Top comments (0)