Solomon Eseme

Posted on Nov 20, 2020 • Originally published at Medium on Nov 17, 2020

Web Scraping with Nuxtjs using Puppeteer

#laravel #vue #nuxt #webdev

Web Scraping with Nuxtjs using Puppeteer is intended to demonstrate how to set up and configure puppeteer to work properly with Nuxtjs and how to use it to Scrape a job listing website and display them on your website.

Since Puppeteer is a Server-side Node package, it becomes very difficult to set it up to work properly with a Client-Side library like Vue.js and there are no many tutorials online on how to set this up properly.

This article is intended to demonstrate how I solve the problem in my client’s project using Nuxt.js and Puppeteer.

Web Scraping Fundamentals

Web scrapping can sound very strange at first but it’s really a very simple term to understand.

The term web scraping is a technique that describes the extraction of data from websites and saved in any desired format for further processing.

Web scrapping automates the manual process of extracting information from websites and storing this information electronically for further processing.

Usage

Extracting product details of e-commerce websites such as prices, product names, images, etc.
Web scraping is very useful in research as it can help to gather structured data from multiple websites.
Gathering data from different sources for analysis can be automated with web scrapping easily.
It can be used to collect data for testing and training machine learning models.

Method of web scraping

Web scraping software : This is the most popular way of web scraping where pre-made software is deployed for the purpose of web scraping.
Writing code: This is a method where a developer is hired to develop the scraping scripts based on the input of the owner to scrape a specific website.

Introduction to Puppeteer

Puppeteer is a Node library that is used to scrape web pages, automate form submission, etc.

It is Google’s official Node library for controlling Google Chrome instance from Node.js, it can also be configured to run in headless mode and make it run in the background.

Puppeteer can be used for several use cases but I will only list a few below.

Web page scraping.
Tracking page load performance.
Automate form submissions.
Generate page screenshots
Generate PDF of website pages
Very useful for Automated Testing.
Performs any browser automation
Use to render the server-side of a single-page app for preview.
etc.

Building a JobScrapper Application with Nuxtjs using Puppeteer

Creating a new Nuxt Project

Before we start developing our web scrapper, we need to install and set up Nuxtjs, following the simple step in the official documentation can help speed up the process.

Type in the following commands to set up the project and accept the default set up rules.

yarn create nuxt-app <project-name>

After installation, let’s start by creating the different components, stores, and pages that will be needed in this project.

Create a component called jobs to display a list of all the jobs scraped.

cd components 

touch Jobs.vue

Next, create a new job store in the store's folder to manage our jobs state.

cd store 

touch job.js

Lastly, let's create a jobs page inside the pages folder for our navigation if needed anyway.

cd pages touch jobs.vue

Of course, this is limited as your project can be complex and contains plenty of components, pages, and stores to manage different states.

Installing dependencies.

Next is to install all the necessary dependencies needed to scrape pages with nuxtjs and puppeteer.

npm i puppeteer net tls

Run the command to install the puppeteer library and other support libraries.

Configuring Puppeteer

This is the difficult part, I had different issues configuring my puppeteer to work with nuxtjs because nuxtjs is both client and server-side framework.

It becomes difficult to know where to place puppeteer or how to call it from the server-side since puppeteer is a server node library and only works on the server-side of nuxtjs.

I will just go ahead to explain how I get it working on my project.

First, let’s create a new script.js file in the root directory and paste in the following codes.

const saveFile = require('fs').writeFileSync 

const pkgJsonPath = require.main.paths[0] + '/puppeteer' + '/package.json' 

// console.log(pkgJsonPath) 
const json = require(pkgJsonPath) 

// eslint-disable-next-line no-prototype-builtins 
if (!json.hasOwnProperty('browser')) { json.browser = {} } 

delete json.browser.ws 

saveFile(pkgJsonPath, JSON.stringify(json, null, 2))

Looking at the script you might understand what it does, if not, I will explain.

It goes into node_modules/puppeteer/package.json file and delete a particular line.

Before deleting that line, it checks if the package.json has the broswer object, if not create a new one, else move on to delete the ws property of the browser object and save the file finally.

The script is going to run each time we run npm install.

The ws is puppeteer's web socket that was set to a web socket that does not exist in our project.

By deleting that line each time will run npm install puppeteer will default to using the web socket that is in our node_modules folder.

Now, let’s add the script to our package.json file where it will be executed as a postinstall script.

Open your package.json file and add the following code.

....... 

"scripts": { 
     "dev": "nuxt", 
     "build": "nuxt build", 
     "start": "nuxt start", 
     "export": "nuxt export", 
     "serve": "nuxt serve", 
     "lint:js": "eslint --ext .js,.vue --ignore-path .gitignore .",
     "lint": "yarn lint:js", "test": "jest", 
     "postinstall": "node script" 
}, 

....

You also need to add the following code into your package.json file.

....... 

"browser": { 
   "fs": false, 
   "path": false, 
   "os": false, 
   "tls": false 
} 

.......

That just sets fs, path, os and tls to false because these are only needed on the server-side of things.

Now that the hard part is off, let’s configure Webpack to deal with puppeteer correctly.

Open your nuxt.config.js file and add the following line inside the build object.

build: {     
 extend(config, { isServer, isClient }) {       
   config.externals = config.externals || {}       
   **if** (!isServer) {         
    config.node = {           
    fs: 'empty',         
   }         
   **if** (Array.isArray(config.externals)) {
    config.externals.push({             
      puppeteer: require('puppeteer'),           
    }) } 
   **else** {           
    config.externals.puppeteer = require('puppeteer')         
   }       
 }       
 config.output.globalObject = 'this'       
 **return** config     
 },   
},

This configuration only requires puppeteer and adds it to externals array only when Nuxtjs is at the client-side and set fs to empty too.

If you did everything right, your puppeteer should be ready to use with Nuxtjs to scrape pages, if you’re stuck you can grab the repository here.

Now we can move to the easy part.

Web Scrapping

Create a file called JobScrapper.js and paste in the following code.

In my project, I was given a list of websites I should scrape to avoid violating any scrapping rules (Just saying 🙂

const puppeteer = require('puppeteer') 
const jobUrl = // SITE URL HERE let page let browser 
let cardArr = [] 
class Jobs { 

   // We will add 3 methods here 
   // Initializes and create puppeteer instance 
   static async init(){} 

   // Visits the page, retrieves the job 
   static async resolver() {} 

   // Converts the job to array 
   static async getJobs() {} 
} 
export default Jobs

Create the Init method

static async init() { 
  browser = await puppeteer.launch({ 
    // headless: false, 
    args: [ 
      '--no-sandbox', 
      '--disable-setuid-sandbox', 
      '--disable-dev-shm-usage', 
      '--disable-accelerated-2d-canvas', 
      '--no-first-run', '--no-zygote', 
      '--single-process', // <- this one doesn't works in Window         
      '--disable-gpu', 
    ], 
}) 

 page = await browser.newPage() 
 await Promise.race([ 
   await page.goto(jobUrl, { waitUntil: 'networkidle2' }).catch(() => {}), 

  await page.waitForSelector('.search-card').catch(() => {}), 

 ]) 

}

The init function initializes puppeteer with several configurations, creates a new page with browser.newPage(), visit our URL with await page.goto(.........), and wait for the page to load successfully with await page.waitForSelector(.....)

Create a Resolver method.

// Visits the page, retrieves the job

static async resolver() {

    await this.init()

    const jobURLs = await page.evaluate(() => {

        const cards = document.querySelectorAll('.search-card')

        cardArr = Array.from(cards)

        const cardLinks = []

        cardArr.map((card) => {

            const cardTitle = card.querySelector('.card-title-link')

            const cardDesc = card.querySelector('.card-description')

            const cardCompany = card.querySelector('a[data-cy="search-result-company-name"]')

            const cardDate = card.querySelector('.posted-date')

           const { text } = cardTitle

           const { host } = cardTitle

           const { protocol } = cardTitle

           const pathName = cardTitle.pathname

           const query = cardTitle.search

           const titleURL = protocol + '//' + host + pathName + query

           const company = cardCompany.textContent

           cardLinks.push({

                 titleText: text,

                 titleURLHost: host,

                 titleURLPathname: pathName,

                 titleURLSearchQuery: query,

                 titleURL: titleURL,

                 titleDesc: cardDesc.innerHTML,

                 titleCompany: company,

                 titleDate: cardDate.textContent,

           })

       })

      return cardLinks

   })

   return jobURLs

}

This method does all the job.

Firstly, it selects all the Jobs listed, convert it to javascript array and loop through each of them while retrieving the data needed.

Create a getJobs method

static async getJobs() { 
    const jobs = await this.resolve() 
    await browser.close() 
    const data = {} 
    data.jobs = jobs 
    data.total_jobs = jobs.length 
    return data 
}

The method simply returns the job array from the resolver method and closes the browser.

Creating Vuex action

Next, we are going to set up our Vuex store to retrieve the jobs each time we dispatch the getJobs action and store them to state.

Open the job file and add the following codes.

import JobScrapper from '~/JobScrapper' 

// Action 
async getJobs({ commit }) { 
    const data = await JobScrapper.getJobs(); 
    if (data.total_jobs) { 
        commit('STORE_JOBS', data) 
        return data.jobs 
    } 
} 

// Mutation 
STORE_JOBS(state, payload) { 
    state.jobs = payload.jobs 
    state.total_jobs = payload.total_jobs 
}, 

// Getter 
export const getters = { 
    getJobs: (state) => () => { 
        return state.jobs 
    }, 
} 

// State 
export const state = () => ({ 
   jobs: [], 
   total_jobs: 0, 
})

Displaying Jobs

Open pages/jobs.vue file and add the following codes.

<template> 
    <div class="row mt-5"> 
        <div class="card-group"> 
            <div class="row"> 
                <div class="col-md-8"> 
                    <Job v-for="(job, i) in jobs" :key="i" :job="job" /> 
              </div> 
           </div> 
      </div> 
   </div> 
</template> 

<script> 
export default { 
    async asyncData({ store }) { 
        const getJobs = store.getters['job/getJobs'] 
        let jobs = getJobs() 
        if (!jobs.length) { 
            jobs = await store.dispatch('job/getJobs') 
        } 
     return { jobs } 
    } 
} 

</script>

This is just one way you could dispatch the actions in each of the pages you want, but it has to be within the asyncData() hook because it is called from the server-side.

Another way or my best way could be to dispatch the action inside nuxtServerInit action which will dispatch the action on every new page load.

Let me show you how to do that.

Create an index.js file inside the store folder and add the following codes.

async nuxtServerInit({ dispatch }) { 
    try { 
        await dispatch('job/getJobs') 
    } catch (error) {} 
},

This will scrape the jobs and save it to state, you can then use ...mapState or ...mapGetters to retrieve the job and display it in your component.

In my project, I use the nuxtServerInit approach and ...mapState in any of the components, I want to display the job.

Jobs Component

<template> 
    <section> 
         ........ 
         <div class="row mb-1 mt-5" v-if="jobs.length !== 0"> 
             <div v-for="job in jobs" :key="job.id" class="col-md-6 col-sm-12 mb-4" > 

                // My JOB component to display a specific job 
                <Job :job="job" /> 
             </div> 
         </div> 
        <div v-else class="row mb-1 mt-5">No Jobs at this time</div>
        .......... 
  </section> 

</template> 
<script> 
import { mapState } from 'vuex' 

export default { 
   computed: { 
       ...mapState({ jobs: (state) => { 
            return [...state.job.jobs].slice(0, 10) 
       }, '
    }), 
 }, 
} 
</script> 

<style></style>

That’s all.

Except you want to see my Job component, then clone the repository here, everything can be found there.

P:S

This method of web scraping with Nuxtjs using puppeteer has many workarounds and maybe a little difficult to understand for beginners, though it works properly because I have used it in my projects.

I have a better approach on how to handle web scraping with Nuxtjs using Node/express and puppeteer, I will be writing about it too.

Consider joining our newsletter to never miss a thing when it drops.

References

Conclusion:

Congratulations for making it this far, by now you should have a deep understanding of web scrapping using puppeteer in Nuxt.js.

You should also have built and completed the JobScrapper Project.

Keep coding 🙂

Originally published at https://masteringbackend.com on November 17, 2020.

DEV Community