** Originally posted at https://blog.parkeragee.com/post/scraping-data-to-generate-markdown-files-and-populate-a-statically-generated-site-with-content/
In this post, I'm going to show you how I efficiently added 300+ web pages of content to one of my clients website by creating a script that will scrape the web and generate markdown files from that data.
This client is a wig distributor and they needed picture and names of all of their available wigs on their website. So, instead of manually creating each page, copying & pasting images and names, I created a script to grab all of that information from the manufacturer's website.
Let's get started..
First things first
We need to create a directory that our script will be added to. Go ahead and run mkdir content-scraper && cd $_
. That will create our directory and move us into it.
Next, we want to run npm init -y
to setup our project's package.json file.
Once we've created our package.json file, we need to install some node packages to help us achieve our goal. Run npm install --save path json2md axios cheerio fs-extra chalk
install of the required packages.
Now, let's create the file we'll be working in - touch index.js
Let's build our scipt in index.js
Add node packages
First, let's bring in all of our node packages.
const path = require('path');
const json2md = require("json2md")
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs-extra');
const chalk = require('chalk');
Now, we want to create a function that will run when we initiate the script. Let's drop down and add the following:
async function go() {
console.log('Running...');
}
go();
Now, if we run node index.js
, we should get Running...
in the terminal.
Get the HMTL of the web page
Next, we're going to use cheerio.js
to get the HTML of the web page we want to scrape. Let's create a new function for that code.
Add the new function to your file.
async function getHtml(url) {
const { data: html } = await axios.get(url);
return html;
}
This is going to use axios
to make a request and fetch the HTML contents of the URL we pass it. We are going to return that HTML to our go()
function below.
In our go()
function we've already added, let's initiate the getHtml()
function and pass it our URL. Add the following to your go()
function:
async function go() {
const url = process.argv[2];
if (url === undefined) {
console.log(chalk.white.bgRed.bold('Please provide a URL to scrape.'));
console.log('Try something like:');
console.log(chalk.green('node index.js https://www.hairuwear.com/raquel-welch/products-rw/signature-wig-collection-2/'));
return;
}
const html = await getHtml(url);
console.log(html);
}
We're checking to see if we passed a URL via the terminal. If not, we display an error message in the terminal explaining how to run the script. If we pass a valid URL, then you should see the HTML for that page displayed in your terminal after running the script.
Scrape data from HTML
Now that we have the HTML from the web page, we need to gather the data that we need for our markdown files. Let's create a new function to take that HTML and find our data.
async function getwigs(html) {
// Load the HTML as a cheerio instance
const $ = cheerio.load(html);
// Find the products list elements
const wigSpan = $('.products li');
// We want to make a new directory for our markdown files to live
const directory = path.join('.', 'wig-pages');
await fs.mkdirs(directory);
// Loop through wigs and get data
for (let i = 0; i < wigSpan.length; i++) {
// Giving ourselves a little feedback about the process
console.log(`Getting ${i} of ${wigSpan.length - 1}`);
// Get the DOM elements we need
const wigLinkSpan = $(wigSpan[i]).find('a')[0];
const wigNameSpan = $(wigLinkSpan).find('h3')[0];
// Get wig link and name data
const wigLink = $(wigLinkSpan).attr('href');
const wigName = $(wigNameSpan).text();
console.log(wigLink, wigName);
}
}
Now, let's initiate that function with our HTML in the go()
function. Your go()
function should now look like this:
async function go() {
const url = process.argv[2];
if (url === undefined) {
console.log(chalk.white.bgRed.bold('Please provide a URL to scrape.'));
console.log('Try something like:');
console.log(chalk.green('node index.js https://www.hairuwear.com/raquel-welch/products-rw/signature-wig-collection-2/'));
return;
}
const html = await getHtml(url);
const data = await getWigs(html);
console.log(data);
}
You should now see a link and name for each wig on the page.
Get the high-res image from the wig page
If you notice on this page we're looking at, the images are pretty low-res. But, if you click on each wig, it will take you to a detailed page about that specific wig with higher-res photos on it. So what we need to do now is for each wig on this page, we'll need to grab that HTML for the detail page as well and pull the high-res photo from that page to add to our data.
We'll do that by going into our for
loop where we get the wig link and name and add the code there. It should look like this:
async function getWigs(html) {
// Load the HTML as a cheerio instance
const $ = cheerio.load(html);
// Find the products list elements
const wigSpan = $('.products li');
// We want to make a new directory for our markdown files to live
const directory = path.join('.', 'wig-pages');
await fs.mkdirs(directory);
// Loop through wigs and get data
for (let i = 0; i < wigSpan.length; i++) {
// Giving ourselves a little feedback about the process
console.log(`Getting ${i} of ${wigSpan.length - 1}`);
// Get the DOM elements we need
const wigLinkSpan = $(wigSpan[i]).find('a')[0];
const wigNameSpan = $(wigLinkSpan).find('h3')[0];
// Get wig link and name data
const wigLink = $(wigLinkSpan).attr('href');
const wigName = $(wigNameSpan).text();
const wigDetailsHtml = await getHtml(wigLink);
const wigHtml = cheerio.load(wigDetailsHtml);
const imgSrc = wigHtml('div.images > a > img').attr('src');
console.log(wigLink, wigName, imgSrc);
}
}
You'll notice we added 3 lines of code here for getting the high-res image.
const wigDetailsHtml = await getHtml(wigLink);
const wigHtml = cheerio.load(wigDetailsHtml);
const imgSrc = wigHtml('div.images > a > img').attr('src');
We're going to reuse our getHtml()
function and pass the wig detail page link to it. Then, we'll find the high-res image's DOM element and grab the src
attribute's value. Now we have our high-res image source data. If you run node index.js
, you'll notice that the script is running a little slower now that we are making additional requests, but we are getting all of the data we need.
JSON to Markdown
Now, we're going to make all of this come together with json2md
. Let's create a new function that will take our scraped data and create some markdown for each wig.
async function generateMarkdown(data) {
const heading = `---\ntitle: ${data.name}\nthumbnail: '${data.imgSrc}'\n---\n\n`;
const md = await json2md([
{
h1: data.name
},
{
link: {
title: data.name,
source: data.link,
}
},
{
img: {
title: data.name,
source: data.imgSrc,
}
}
]);
return `${heading}${md}`;
}
And we'll need to run that function for each one of our wigs that we need a page for. So, we'll add it to our for
loop in the getWigs()
function. Your getWigs()
function should look like this now:
async function getWigs(html) {
// Load the HTML as a cheerio instance
const $ = cheerio.load(html);
// Find the products list elements
const wigSpan = $('.products li');
// We want to make a new directory for our markdown files to live
const directory = path.join('.', 'wig-pages');
await fs.mkdirs(directory);
// Loop through wigs and get data
for (let i = 0; i < wigSpan.length; i++) {
// Giving ourselves a little feedback about the process
console.log(`Getting ${i} of ${wigSpan.length - 1}`);
// Get the DOM elements we need
const wigLinkSpan = $(wigSpan[i]).find('a')[0];
const wigNameSpan = $(wigLinkSpan).find('h3')[0];
// Get wig link and name data
const wigLink = $(wigLinkSpan).attr('href');
const wigName = $(wigNameSpan).text();
// Get high-res photo from detail page
const wigDetailsHtml = await getHtml(wigLink);
const wigHtml = cheerio.load(wigDetailsHtml);
const imgSrc = wigHtml('div.images > a > img').attr('src');
// create markdown here
const markdown = await generateMarkdown({
name: wigName,
link: wigLink,
imgSrc,
});
console.log(markdown);
}
}
Now, when you run node index.js
, you should get some markdown that looks like this:
---
title: If You Dare
thumbnail: 'https://www.hairuwear.com/wp-content/uploads/RW-ifyoudare.jpg'
---
# If You Dare
[If You Dare](https://www.hairuwear.com/product/if-you-dare/)
![If You Dare](https://www.hairuwear.com/wp-content/uploads/RW-ifyoudare.jpg)
Next, we just need to create our file with the markdown as the content. Add this 2 lines of code right after the previous addition:
const file = path.join('.', 'wig-pages', `${wigName.split(' ').join('-')}.md`);
await fs.writeFile(file, markdown);
So our getWigs()
function should look like this now:
async function getWigs(html) {
// Load the HTML as a cheerio instance
const $ = cheerio.load(html);
// Find the products list elements
const wigSpan = $('.products li');
// We want to make a new directory for our markdown files to live
const directory = path.join('.', 'wig-pages');
await fs.mkdirs(directory);
// Loop through wigs and get data
for (let i = 0; i < wigSpan.length; i++) {
// Giving ourselves a little feedback about the process
console.log(`Getting ${i} of ${wigSpan.length - 1}`);
// Get the DOM elements we need
const wigLinkSpan = $(wigSpan[i]).find('a')[0];
const wigNameSpan = $(wigLinkSpan).find('h3')[0];
// Get wig link and name data
const wigLink = $(wigLinkSpan).attr('href');
const wigName = $(wigNameSpan).text();
// Get high-res photo from detail page
const wigDetailsHtml = await getHtml(wigLink);
const wigHtml = cheerio.load(wigDetailsHtml);
const imgSrc = wigHtml('div.images > a > img').attr('src');
// create markdown here
const markdown = await generateMarkdown({
name: wigName,
link: wigLink,
imgSrc,
});
// Create new markdown file and add markdown content
const file = path.join('.', 'wig-pages', `${wigName.split(' ').join('-')}.md`);
await fs.writeFile(file, markdown);
}
}
Now, we should have a directory called wig-pages
full of markdown files that contain our scraped content. You can just copy this folder into the content directory (depending on your static site generator) of your website and deploy your changes 🎉.
This is just one example of how to scrape data and populate a statically generated site with content. Feel free to take this method and apply it to your own needs.
Top comments (1)
Thank you for this post, I found useful libraries for my current project.