Parker Agee

Posted on Mar 29, 2019

Scraping data to generate markdown files and populate a statically generated site with content

#node #javascript #productivity #tutorial

** Originally posted at https://blog.parkeragee.com/post/scraping-data-to-generate-markdown-files-and-populate-a-statically-generated-site-with-content/

In this post, I'm going to show you how I efficiently added 300+ web pages of content to one of my clients website by creating a script that will scrape the web and generate markdown files from that data.

This client is a wig distributor and they needed picture and names of all of their available wigs on their website. So, instead of manually creating each page, copying & pasting images and names, I created a script to grab all of that information from the manufacturer's website.

Let's get started..

First things first

We need to create a directory that our script will be added to. Go ahead and run mkdir content-scraper && cd $_. That will create our directory and move us into it.

Next, we want to run npm init -y to setup our project's package.json file.

Once we've created our package.json file, we need to install some node packages to help us achieve our goal. Run npm install --save path json2md axios cheerio fs-extra chalk install of the required packages.

Now, let's create the file we'll be working in - touch index.js

Let's build our scipt in index.js

Add node packages

First, let's bring in all of our node packages.

const path = require('path');
const json2md = require("json2md")
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs-extra');
const chalk = require('chalk');

Now, we want to create a function that will run when we initiate the script. Let's drop down and add the following:

async function go() {
    console.log('Running...');
}

go();

Now, if we run node index.js, we should get Running... in the terminal.

Get the HMTL of the web page

Next, we're going to use cheerio.js to get the HTML of the web page we want to scrape. Let's create a new function for that code.

Add the new function to your file.

 async function getHtml(url) {
    const { data: html } = await axios.get(url);
    return html;
}

This is going to use axios to make a request and fetch the HTML contents of the URL we pass it. We are going to return that HTML to our go() function below.

In our go() function we've already added, let's initiate the getHtml() function and pass it our URL. Add the following to your go() function:

async function go() {
    const url = process.argv[2];

    if (url === undefined) {
        console.log(chalk.white.bgRed.bold('Please provide a URL to scrape.'));
        console.log('Try something like:');
        console.log(chalk.green('node index.js https://www.hairuwear.com/raquel-welch/products-rw/signature-wig-collection-2/'));
        return;
    }

    const html = await getHtml(url);
    console.log(html);
}

We're checking to see if we passed a URL via the terminal. If not, we display an error message in the terminal explaining how to run the script. If we pass a valid URL, then you should see the HTML for that page displayed in your terminal after running the script.

Scrape data from HTML

Now that we have the HTML from the web page, we need to gather the data that we need for our markdown files. Let's create a new function to take that HTML and find our data.

 async function getwigs(html) {

    // Load the HTML as a cheerio instance
    const $ = cheerio.load(html);

    // Find the products list elements
    const wigSpan = $('.products li');

    // We want to make a new directory for our markdown files to live
    const directory = path.join('.', 'wig-pages');
    await fs.mkdirs(directory);

    // Loop through wigs and get data
    for (let i = 0; i < wigSpan.length; i++) {

        // Giving ourselves a little feedback about the process
        console.log(`Getting ${i} of ${wigSpan.length - 1}`);

        // Get the DOM elements we need
        const wigLinkSpan = $(wigSpan[i]).find('a')[0];
        const wigNameSpan = $(wigLinkSpan).find('h3')[0];

        // Get wig link and name data
        const wigLink = $(wigLinkSpan).attr('href');
        const wigName = $(wigNameSpan).text();

        console.log(wigLink, wigName);
    }
}

Now, let's initiate that function with our HTML in the go() function. Your go() function should now look like this:

async function go() {
    const url = process.argv[2];

    if (url === undefined) {
        console.log(chalk.white.bgRed.bold('Please provide a URL to scrape.'));
        console.log('Try something like:');
        console.log(chalk.green('node index.js https://www.hairuwear.com/raquel-welch/products-rw/signature-wig-collection-2/'));
        return;
    }

    const html = await getHtml(url);
    const data = await getWigs(html);
    console.log(data);
}

You should now see a link and name for each wig on the page.

Get the high-res image from the wig page

If you notice on this page we're looking at, the images are pretty low-res. But, if you click on each wig, it will take you to a detailed page about that specific wig with higher-res photos on it. So what we need to do now is for each wig on this page, we'll need to grab that HTML for the detail page as well and pull the high-res photo from that page to add to our data.

We'll do that by going into our for loop where we get the wig link and name and add the code there. It should look like this:

async function getWigs(html) {

    // Load the HTML as a cheerio instance
    const $ = cheerio.load(html);

    // Find the products list elements
    const wigSpan = $('.products li');

    // We want to make a new directory for our markdown files to live
    const directory = path.join('.', 'wig-pages');
    await fs.mkdirs(directory);

    // Loop through wigs and get data
    for (let i = 0; i < wigSpan.length; i++) {

        // Giving ourselves a little feedback about the process
        console.log(`Getting ${i} of ${wigSpan.length - 1}`);

        // Get the DOM elements we need
        const wigLinkSpan = $(wigSpan[i]).find('a')[0];
        const wigNameSpan = $(wigLinkSpan).find('h3')[0];

        // Get wig link and name data
        const wigLink = $(wigLinkSpan).attr('href');
        const wigName = $(wigNameSpan).text();

        const wigDetailsHtml = await getHtml(wigLink);
        const wigHtml = cheerio.load(wigDetailsHtml);
        const imgSrc = wigHtml('div.images > a > img').attr('src');

        console.log(wigLink, wigName, imgSrc);
    }
}

You'll notice we added 3 lines of code here for getting the high-res image.

const wigDetailsHtml = await getHtml(wigLink);
const wigHtml = cheerio.load(wigDetailsHtml);
const imgSrc = wigHtml('div.images > a > img').attr('src');

We're going to reuse our getHtml() function and pass the wig detail page link to it. Then, we'll find the high-res image's DOM element and grab the src attribute's value. Now we have our high-res image source data. If you run node index.js, you'll notice that the script is running a little slower now that we are making additional requests, but we are getting all of the data we need.

JSON to Markdown

Now, we're going to make all of this come together with json2md. Let's create a new function that will take our scraped data and create some markdown for each wig.

async function generateMarkdown(data) {
    const heading = `---\ntitle: ${data.name}\nthumbnail: '${data.imgSrc}'\n---\n\n`;

    const md = await json2md([
        {
            h1: data.name
        },
        {
            link: {
                title: data.name,
                source: data.link,
            }
        },
        {
            img: {
                title: data.name,
                source: data.imgSrc,
            }
        }
    ]);

    return `${heading}${md}`;
}

And we'll need to run that function for each one of our wigs that we need a page for. So, we'll add it to our for loop in the getWigs() function. Your getWigs() function should look like this now:

async function getWigs(html) {

    // Load the HTML as a cheerio instance
    const $ = cheerio.load(html);

    // Find the products list elements
    const wigSpan = $('.products li');

    // We want to make a new directory for our markdown files to live
    const directory = path.join('.', 'wig-pages');
    await fs.mkdirs(directory);

    // Loop through wigs and get data
    for (let i = 0; i < wigSpan.length; i++) {

        // Giving ourselves a little feedback about the process
        console.log(`Getting ${i} of ${wigSpan.length - 1}`);

        // Get the DOM elements we need
        const wigLinkSpan = $(wigSpan[i]).find('a')[0];
        const wigNameSpan = $(wigLinkSpan).find('h3')[0];

        // Get wig link and name data
        const wigLink = $(wigLinkSpan).attr('href');
        const wigName = $(wigNameSpan).text();

        // Get high-res photo from detail page
        const wigDetailsHtml = await getHtml(wigLink);
        const wigHtml = cheerio.load(wigDetailsHtml);
        const imgSrc = wigHtml('div.images > a > img').attr('src');

        // create markdown here
        const markdown = await generateMarkdown({
            name: wigName,
            link: wigLink,
            imgSrc,
        });

        console.log(markdown);
    }
}

Now, when you run node index.js, you should get some markdown that looks like this:

---
title: If You Dare
thumbnail: 'https://www.hairuwear.com/wp-content/uploads/RW-ifyoudare.jpg'
---

# If You Dare

[If You Dare](https://www.hairuwear.com/product/if-you-dare/)

![If You Dare](https://www.hairuwear.com/wp-content/uploads/RW-ifyoudare.jpg)

Next, we just need to create our file with the markdown as the content. Add this 2 lines of code right after the previous addition:

const file = path.join('.', 'wig-pages', `${wigName.split(' ').join('-')}.md`);
await fs.writeFile(file, markdown);

So our getWigs() function should look like this now:

async function getWigs(html) {

    // Load the HTML as a cheerio instance
    const $ = cheerio.load(html);

    // Find the products list elements
    const wigSpan = $('.products li');

    // We want to make a new directory for our markdown files to live
    const directory = path.join('.', 'wig-pages');
    await fs.mkdirs(directory);

    // Loop through wigs and get data
    for (let i = 0; i < wigSpan.length; i++) {

        // Giving ourselves a little feedback about the process
        console.log(`Getting ${i} of ${wigSpan.length - 1}`);

        // Get the DOM elements we need
        const wigLinkSpan = $(wigSpan[i]).find('a')[0];
        const wigNameSpan = $(wigLinkSpan).find('h3')[0];

        // Get wig link and name data
        const wigLink = $(wigLinkSpan).attr('href');
        const wigName = $(wigNameSpan).text();

        // Get high-res photo from detail page
        const wigDetailsHtml = await getHtml(wigLink);
        const wigHtml = cheerio.load(wigDetailsHtml);
        const imgSrc = wigHtml('div.images > a > img').attr('src');

        // create markdown here
        const markdown = await generateMarkdown({
            name: wigName,
            link: wigLink,
            imgSrc,
        });

        // Create new markdown file and add markdown content
        const file = path.join('.', 'wig-pages', `${wigName.split(' ').join('-')}.md`);
        await fs.writeFile(file, markdown);
    }
}

Now, we should have a directory called wig-pages full of markdown files that contain our scraped content. You can just copy this folder into the content directory (depending on your static site generator) of your website and deploy your changes 🎉.

This is just one example of how to scrape data and populate a statically generated site with content. Feel free to take this method and apply it to your own needs.

Resources

Top comments (1)

Rognoni • May 25 '19

Thank you for this post, I found useful libraries for my current project.

DEV Community

Scraping data to generate markdown files and populate a statically generated site with content

** Originally posted at https://blog.parkeragee.com/post/scraping-data-to-generate-markdown-files-and-populate-a-statically-generated-site-with-content/

First things first

Let's build our scipt in index.js

Add node packages

Get the HMTL of the web page

Scrape data from HTML

Get the high-res image from the wig page

JSON to Markdown

Resources

Top comments (1)

Read next

Daily JavaScript Challenge #JS-86: Array Symmetry Checker

Why I won't use querySelector again.

Unlocking Success: How a Software Engineer Can Build a Thriving Business in 2025

Finding The Right Co-Founder: A Guide For Startups