loading...

Play with Puppeteer: a simple SEO Spider

victormagarlamov profile image Victor Magarlamov ・3 min read

I will not tell you about Puppeteer. This great library needs no introduction. So without further ado, let's play with it! Create a new node.js project and edit package.json file:

"main": "src/index.js",
"type": "module",
"scripts": {
  "start": "node ."
}

Due to the fact that I will use ES6 modules I set the type parameter as "module". Notice that this works in Node 13 and above.

yarn add puppetter

Well, the goal of our application will be to visit a page and check it with some SEO rules.

SEO rules are secondary in this article. First of all, I want to show how to work with Puppeteer, parse the content of the page, and also want to give an example of working with Command pattern.

We will start by creating a class where the logic for visiting the page will be.

import puppeteer from 'puppeteer';

export default class Spider {
  browser = null;

  asyns launch() {
    this.browser = await puppeteer.launch();
  }

  async visit(url) {    
    const page = await this.browser.newPage();
    await page.goto(url);

    const content = await page.content();  
  }

  async close() {
    await this.browser.close();
  }
} 

Now we can visit to a site by its url and get its content. Content as a string. Now we can parse this string with a regular expression to check, for example, the length of the description meta tag. But I'm not very good at regexp 🀯

There is one great library that allows you to convert a string to a jsdom object. Let's add it in our project

yarn add jsdom

and edit the Spider class:

import puppeteer from 'puppeteer';
import jsdom from 'jsdom';

const { JSDOM } = jsdom;
...
  const content = await page.content(); 
  return new JSDOM(content);
}
...

Now we can work with the content of the page using the querySelector and other similar methods. Let's do it and write a new class for validation the page content. Or more precisely, we will create classes - one class for one validation rule.

export default class CheckTitleCommand {
  value = null;
  errors = [];

  constructor(document) {
    this.value = document.title;
  }

  execute() {
    if (!this.value || this.value.length === 0) {
      this.errors.push('The page title is empty');
    } else if (this.value.length > 50) {
      this.errors.push('The page title is too long');
    }   
  }

  getResult() {
    if (this.errors.length > 0) {
      return {isValid: false, message: this.errors.toString()};
    }

    return {isValid: true, message: 'Title is OK'};
  }
}

We encapsulate the logic of a validation rule in an object - Command pattern in Action. An another command.

export default class CheckDescriptionCommand {
  value = null;
  errors = [];

  constructor(document) {
    this.value = document.head.querySelector('meta[name=description]');
  }

  execute() {
    if (!this.value || this.value.length === 0) {
      this.errors.push('The page description is empty');
    }
  }

  getResult() {
    if (this.errors.length > 0) {
      return {isValid: false, message: this.errors.toString()};
    }

    return {isValid: true, message: 'Meta description is OK'};
  }
}

All commands have a common interface. Let's see how to work with it.

import CheckTitleCommand from './commands/CheckTitleCommand.js';
import CheckDescriptionCommand from './commands/CheckDescriptionCommand.js';

export default class Validator {
  document = null;

  constructor(dom) {
    this.document = dom.window.document;
  }

  validate() {
    [
      new CheckTitleCommand(this.document),
      new CheckDescriptionCommand(this.document),
    ].forEach(command => {
       command.execute();
       console.log(command.getResult().message);
    });
  }
}

Let's put it all together and see what it will come to.

import Spider from './Spider.js';
import Validator from './Validator.js';

(async () => {
  const spider = new Spider();
  await spider.launch();

  const dom = await spider.visit('http://wwwwwwww.jodi.org');
  const validator = new Validator(dom);
  validator.validate();

  spider.close();
})();

Discussion

markdown guide