When I was starting out, I never understood how to tackle 'projects'. It was hard to understand the scope of a project from a glance, as well as the steps and considerations. I wanted to see behind the scenes. We all work differently but in this article I'm showing how I work when I get an idea-spark for something tiny. Come watch me scratch an itch.
I was on the MDN Web Docs earlier looking for something to spark a research journey for this week's article when I found that there was no random page link. A random page function would be useful for general revision of web technologies, allowing you to click through to things that sounded interesting.
I began my journey by hacking away at the index page for JavaScript. This is often how I tackle front end web problems at my job too, using DevTools as the draft version of a site.
In the HTML there's a table body located at document.querySelector("#wikiArticle > table > tbody")
. On Chrome, you can right-click an element and go Copy -> Copy JS Path
to find this. Each entry uses three table rows. I'm after the page link and also the summary. Let's separate the rows into groups that make up one entry each.
You can run these lines in your console.
// Index table
const indexTable = document.querySelector("#wikiArticle > table > tbody");
// There are three <tr>s to a row
const allRows = indexTable.querySelectorAll('tr');
const chunkRows = [];
for (var i = 0; i < allRows.length; i += 3) {
chunkRows.push([allRows[i], allRows[i + 1], allRows[i + 2]]);
}
I need to store a copy of the index page's data somewhere. Scraping the page per each request would be wasteful and slow. Storing them as JSON makes the most sense as I'll be delivering this over the web. A page snippet should have a link property and a text property. I'll take care of some of the templating here too.
// Map these chunked rows into a row object
const pageEntries = [];
chunkRows.forEach(row => {
const a = row[0].querySelector('a');
const page = {
link: `<a class="page-link "href="${a.href}">${a.innerText}</a>`,
text: `<div class="page-text">${row[1].querySelector('td').innerText}</div>`
}
pageEntries.push(page);
})
I needed to decide how I was going to deliver this to users. I considered a Chrome extension but I wanted to prototype faster so I turned to Glitch and used their Express template.
I packaged the scraping logic into an async function and used puppeteer to run these commands in the context of the index page. (On Glitch, you have to use --no-sandbox
which can be a security risk.). In the finished project, this script can be called manually with node getNewPages.js
to update the entries on disk. (This would fit neatly into a cron job!).
From Chrome DevTool hackery to something a little more cohesive.
// getNewPages.js
const fs = require('fs');
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
args: ['--no-sandbox'] // Required for Glitch
});
const page = await browser.newPage();
await page.goto('https://developer.mozilla.org/en-US/docs/Web/JavaScript/Index');
// Scan index table, gather links and summaries
const pageEntries = await page.evaluate(() => {
// Index table
const indexTable = document.querySelector("#wikiArticle > table > tbody");
// There are three <tr>s to a row
const allRows = indexTable.querySelectorAll('tr');
const chunkRows = [];
for (var i = 0; i < allRows.length; i += 3) {
chunkRows.push([allRows[i], allRows[i + 1], allRows[i + 2]]);
}
// Map these chunked rows into a row object
const pageEntries = [];
chunkRows.forEach(row => {
const a = row[0].querySelector('a');
const page = {
link: `<a class="page-link "href="${a.href}">${a.innerText}</a>`,
text: `<div class="page-text">${row[1].querySelector('td').innerText}</div>`
}
pageEntries.push(page);
})
return pageEntries;
});
// Save page objects to JSON on disk
const pageJSON = JSON.stringify(pageEntries);
fs.writeFile('./data/pages.json', pageJSON, function (err) {
if (err) {
return console.log(err);
}
console.log(`New pages saved! (JSON length: ${pageJSON.length})`);
});
await browser.close();
})();
You can load JSON files into Node with require('./data/pages.json')
which keeps all of our page entries in memory for faster access (this is possible due to the fixed, small size of ~300kb). The rest of our web app is a wrapper around a random function.
// server.js
// init project
const express = require('express');
const app = express();
app.use(express.static('public'));
const pages = require('./data/pages.json');
const rndPage = () => pages[Math.floor(Math.random() * pages.length)];
app.get('/', (req, res) => res.sendFile(__dirname + '/views/index.html'));
app.get('/rnd', (req, res) => res.send(rndPage()));
// listen for requests :)
const listener = app.listen(process.env.PORT, function() {
console.log('Your app is listening on port ' + listener.address().port);
});
The only client-side JavaScript we need is a fetch
call to update the page link and summary. As well as a button with a cute wobble animation to flip through the entries. There's a 1 in 897 chance of getting the same snippet twice, how would you solve this Let me know in the comments!
Where to go from here: there are other pages on the MDN Docs (other than JavaScript) that follow the same row pattern, meaning they can be scraped simply by changing the URL in our script.
You can remix the project to play with a copy of the code live or clone the repo: healeycodes/random-mdn-page.
Join 150+ people signed up to my newsletter on programming and personal growth!
I tweet about tech @healeycodes.
Top comments (6)
Great post describing the full process from idea to production. Also I didn't hear about Glitch before your post. It's so easy to develop and deploy Node apps on Glitch!
I'm just curious why did you choose to use
puppeteer
for scraping. I thought it's more suitable for end-to-end tests. I think it adds extra complexity by introducing a browser's launch/exiting, sandboxes, page evaluating.Maybe it'll be better to use a fetch library (like
request
) to fetch the MDN page and parse it withcheerio
in jQuery style?Thanks Eugene π
I had only scraped with Python before but wanted this project to be all JavaScript. So I googled and went with one of the first results. Thanks for the recommendations!
The first ideas that come to my mind to prevent the script from randomly picking the same entry twice for the current session at least, I would remove the chosen item from the choice pool or keep a memory of chosen items
Using the current session is an interesting way of solving it! Nice idea. β¨
As for the unlikely but possible chance of getting the same link twice in a row, you could create a small array of the last few links generated, and tell the function to choose a new random link if it matches any link in the array. This array would also allow for some additional functionality (like an βoopsβ button allowing you to go back to the last link when youβve accidentally clicked to generate a new link when you saw one you were interested in.) Maybe thatβs a bit more than necessary, or maybe you could expand on that a little further even. JM2C
I like this idea. I think Iβd prefer to do it on the client even if itβs +1 request (rarely) as you described.
I like the state idea too, the βoopsβ button is cool π