DEV Community

loading...

Scraping websites with Xray

Dennis O'Keeffe
Software Engineer by trade. Formerly of Culture Amp, UsabilityHub, Present Company and NightGuru.
Originally published at blog.dennisokeeffe.com ・2 min read

In this short post, we're going to scrape the website that this blog is hosted on to get all the links and posts back using Nodejs and Xray.

Setup

We are going to keep things super minimal and bare. We just want a proof of concept on how to scrape the data from the rendered website HTML.

mkdir hello-xray
cd hello-xray
yarn init -y
yarn add x-ray
touch index.js
Enter fullscreen mode Exit fullscreen mode

Scraping the website

Going to the blog and inspecting with the Developer Tools, we can see that there isn't many classes to go with, but we can use the selectors to decide how we are going to get the information back.

The website with developer tools

Create a new file index.js and add the following:

const Xray = require("x-ray")

function getPosts(url = "https://blog.dennisokeeffe.com/") {
  const x = Xray()
  return new Promise((resolve, reject) => {
    x(`${url}`, "main:last-child", {
      items: x("div", [
        {
          title: "h3 > a",
          description: "p",
          link: "h3 > a@href",
          date: "small",
        },
      ]),
    })((err, data) => {
      if (err) {
        reject(err)
      }

      resolve(data)
    })
  })
}

const main = async () => {
  const posts = await getPosts()
  console.log(posts)
}

main()
Enter fullscreen mode Exit fullscreen mode

In the above script, we are simply running a main function that calls getPosts and waits for the Promise to resolve before logging out the results.

The important part of the code comes from within the getPosts function:

x(`${url}`, "main:last-child", {
  items: x("div", [
    {
      title: "h3 > a",
      description: "p",
      link: "h3 > a@href",
      date: "small",
    },
  ]),
})((err, data) => {
  if (err) {
    reject(err)
  }

  resolve(data)
})
Enter fullscreen mode Exit fullscreen mode

The x function is calling the blog URL, the looking for the last child of the main DOM element you can see in the HTML DOM from the image shared above.

We are telling Xray to return an array of items, and within that, we want to add all the elements that fit the object we pass. In our case, I am using standard selectors to grab the title, description and date, but am using the extra @href helper with the link to fetch the URL to the blog post!

That's it! Let's run the scraper now using node index.js.

Result

Perfect! Now you can take these same shorts tips and apply to anything you need to scrape down the track. Looking for alternatives or to use automation? You should also checkout Puppeteer or Playwright (added to resource links).

Resources and Further Reading

  1. GitHub - Xray
  2. GitHub - Puppeteer
  3. GitHub - Playwright
  4. Completed project

Originally posted on my blog. Follow me on Twitter for more hidden gems @dennisokeeffe92.

Discussion (1)

Collapse
functional_js profile image
Functional Javascript

Nice one Dennis.
I tested it out and it works.

I converted it from an explicit Promise idiom to an async-await idiom....

const Xray = require('x-ray');

//util
const lpromise = p => p.then(o => console.log(o.items));

/*
@func
retrieve posts using xray

@typedef {{items: string[]}} itemsObj
@return {Promise<itemsObj>}
*/
const getPosts = async () => {
  const x = Xray();
  const url = "https://blog.dennisokeeffe.com";
  try {
    return await x(url, "main:last-child", {
      items: x("div", [
        {
          title: "h3 > a",
          description: "p",
          link: "h3 > a@href",
          date: "small",
        },
      ]),
    });
  } catch (err) {
    console.error(err);
  }
};

//@tests
lpromise(getPosts());