Francesco Napoletano

Posted on Aug 17, 2018 • Edited on Jun 30, 2019

How to scrape that web page with Node.js and puppeteer

#node #javascript #webscraping #chromium

If you're like me sometimes you want to scrape a web page so bad. You probably want some data in a readable format or just need a way to re-crunch that data for other purposes.

I solemnly swear that I am up to no good.

I've found my optimal setup after many tries with Guzzle, BeautifulSoup, etc... Here it is:

Node.js
Puppeteer: check https://github.com/GoogleChrome/puppeteer
A little Raspberry Pi where my scripts can run all day long.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

What does it mean? It means you can run a Chrome instance and put it at your service. Cool, isn't it?

Let's see how to do it.

Setup

Yes, the usual setup. Fire up your terminal, create a folder for your project and run npm init in the folder.

When you're setup you'll probably have a package.json file. We're good to go. Now run npm i -S puppeteer to install Puppeteer.

A little warning. Puppeteer will download a full version of Chromium in your node_modules folder

Don't worry: since version 1.7.0 Google publishes the puppeteer-core package, a version of Puppeteer that doesn't download Chromium by default.

So, if you're willing to try it, just run npm i -S puppeteer-core

puppeteer-core is intended to be a lightweight version of puppeteer for launching an existing browser installation or for connecting to a remote one.

Ok, we're good to go now.

Your first scraper

Touch an index.js file in the project folder and paste this code in it.

That's all you need to setup a web scraper. You can also find it in my repo https://github.com/napolux/puppy.

Let's dig a bit in the code

For the sake of our example we'll just grab all the post titles and URLs from my blog homepage. To add a nice touch we'll change our user-agent in order to look like a good old iPhone while browsing the webpage we're scraping.

And because we're lazy, we'll inject jQuery to the page in order to use it's wonderful CSS selectors.

So... Let's go line by line:

Line 1-2 we'll require Puppeteer and configure the website we're going to scrape
Line 4 we're launching Puppeteer. Please remember we're in the kingdom of Lord Asynchronous, so everything is a Promise, is async, or has to wait for something else ;) As you can see the conf is self-explanatory. We're telling the script to run Chromium headless (no UI).
Line 5-10 The browser is up, we create a new page, we set the viewport size to a mobile screen, we set a fake user-agent and we open the webpage we want to scrape. In order to be sure that the page is loaded, we wait for the selector body.blog to be there.
Line 11 As I said, we are injecting jQuery into the page
Line 13-28 Here is where the magic happens: we evaluate our page and run some jQuery code in order to extract the data we need. Nothing fancy, if you ask me.
Line 31-37 We're done: we close the browser and print out our data:

Run from the project folder node index.js and you should end up with something like...

Post: Blah blah 1? URL: https://coding.napolux.com/blah1/
Post: Blah blah 2? URL: https://coding.napolux.com/blah2/
Post: Blah blah 3? URL: https://coding.napolux.com/blah3/

Recap

So, welcome to the world of web scraping. It was easier than expected, right? Just remember that web scraping is a controversial matter: please scrape only websites you're authorized to scrape.

No. As the owner of https://coding.napolux.com I don't authorize you

I leave to you how to scrape AJAX based webpages ;)

Originally published @ https://coding.napolux.com

Top comments (6)

Gon • Jun 30 '19

This is a great and concisely well-explained article. I decided to try using the whole block within lines 13-28, and I keep getting errors of

< (node:65901) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: reject is not defined

How could I resolve this error?

Francesco Napoletano • Jun 30 '19

Well, the puppeteer.launch().then(async browser => { etc... is a promise itself, so the reject is there.

Just tried the code and it still works.

USB-internet • Sep 17 '18

Hi,
Francesco Napoletano,
Your code is great !!!

But, I can not save data to a .txt file. It reports an Undefined error. Help me fix it. Why use:

for(var i = 0; i < result.length; i++) {
console.log('Post: ' + result[i].title + ' URL: ' + result[i].url);

}

I can not export the value, it just seems to print to the screen
If exported to .txt file, it appears Undefined error. Please help me export the .txt file

!!! Thanks

NO ERROR BUT devnew Undefined !!!
var devnew = result.title ;

fs.writeFile('devnew.txt',devnew,'utf8');

Mihail Malo • Feb 16 '19

Title says scrap instead of scrape

Menj • Feb 28 '19

how to save the result on a MySQL database?

RedRosh • Nov 3 '21

Hi , if you to want to export the result to mysql db , you can download a library mssql , and usse it to connect to your db , then make a request to it from your script to export the data
Ref : npmjs.com/package/mysql