Introduction
For my InvoiceMaker application, I used Puppeteer to generate a PDF. I quite enjoyed the simplicity of the tool, and to showcase some of the many things you can do with Puppeteer, I thought I would make a little application.
What can Puppeteer do?
According to the Github README, here are some of the things Puppeteer can help you with:
Puppeteer is a tool that makes it easier to scrape the web. This is a headless instance of the Chrome browser ( so the Chrome browser, without the UI). Web scraping means that you consult a website and extract data from it.
What we'll build
So, we will be building a little command line application. This article will make sure we can do two things for now:
- Given a username, generate a screenshot of that user's personal page.
- Given a username, retrieve the last article that user wrote and generate it as a PDF.
Setup
So, let's create a folder called cli-scraping. Inside it, run yarn init (or npm init, but I'll be using yarn here.). Accept the defaults and create a index.js file. Then, run yarn add puppeteer. Finally, create two folders inside cli-scraping: screenshots-users and pdfs. Let's get coding.
Getting command line arguments
We will use process.argv to get the arguments we provide. It will return an array, with at least two elements. Let's try it:
console.log(process.argv)
When I run node index.js
, I get in my console:
[ '/usr/local/Cellar/node/11.4.0/bin/node',
'/Users/Damien/Desktop/javascript/scraping/index.js' ]
You will get a different result, but you will get 2 elements. The first one is the runtime used ( here node v11.4.0), the second is the script's path. So, every argument we'll give will start at process.argv[2]. If I run node index.js blabla
, process.argv[2] will be blabla
. Ok? Nice and easy. So, now we know how we'll retrieve arguments. Let's move on to puppeteer.
Generate a screenshot
To generate a screenshot, we would use the following code:
(async () => {
// Launching an instance of a headless Chrome browser
const browser = await puppeteer.launch()
// Create a new page
const page = await browser.newPage()
// Move to the specified url
await page.goto('urlToThePage')
// Take a screenshot and save it at the specified path
await page.screenshot({ path: 'screenshot.png' })
// Close the browser
await browser.close()
}
Ok, so what do we need to do?
- Create a function to wrap this functionality.
- Call that function from the command line
- Give the functionality the proper data ( page url, username )
I will use the following convention for my application: the first argument will be the name of the function, the second will be the username. So, the code could be:
const puppeteer = require('puppeteer')
const getScreenshot = async username => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(`https://dev.to/${username}`)
await page.screenshot({ path: `screenshots-users/${username}.png`, fullPage: true })
await browser.close()
}
switch (process.argv[2]) {
case 'getScreen':
getScreenshot(process.argv[3])
break
default:
console.log('Wrong argument!')
}
First, we import puppeteer. Then, we create our getScreenshot function that will take care of generating our screenshots. The skeleton of the function has been seen earlier. Notice a few changes:
- page.goto takes the proper dev.to url with the username provided.
- page.screenshot puts the PNG file in the screenshots folder, with the username as the file name. Notice the fullPage: true to get the full page.
Finally, we have a switch statement. I used getScreen as the argument name to generate screenshots.
Great, now I can run node index.js getScreen damcosset
to get the screenshot of my profile. And I can see the screenshot in the screenshots-users folder called damcosset.png:
Note: I am cutting the screenshots to save space, but the entire page is available in the screenshot ;)
Let's now run node index.js getScreen ben
and we'll get the following screenshot in the folder called ben.png:
Generating a PDF
For this, we have three different steps:
1- Go to the user personal page
2- Click on the last article she wrote to navigate there
3- Retrieve an attribute to make sure our pdf name is unique ( optional I guess )
4- Generate the PDF
Let's create a function called getPDF. The code inside would look like this:
const getPDF = async username => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(`https://dev.to/${username}`)
await Promise.all([page.waitForNavigation(), page.click('.single-article')])
const dataPath = await page.evaluate(() =>
document.querySelector('.article').getAttribute('data-path')
)
await page.pdf({ path: `pdfs/${dataPath.split('/')[2]}.pdf` })
await browser.close()
}
The first 3 lines are the always the same, initiate, new page, goto... Then, we have a Promise.all. We are waiting for two actions here:
- A click on an article card.
- Then, the page where that article lives needs to load
We need to explore the HTML content of the page here. In the dev tools, I can see that each article in the users personal pages have a class called single-article. So, this is what we'll target. For that, we'll use the page.click function and give it that selector.
This will target the first element with that selector, and because dev.to presents your newer article first, this is exactly what I was looking for.
Next, when I studied the HTML structure, I saw that each article is contained in a div with the article class. This element has a data-path attribute. By using page.evaluate, I can get that node then retrieve this attribute. This will assure that there will be no conflicts when saving our pdfs.
Finally, I'll call page.pdf and give it a path in the options. The data-path I retrieved gives something like /username/title-article-000
so I just split it to get the last part.
Finally, don't forget to add a case in our switch statement:
switch (process.argv[2]) {
case 'getScreen':
getScreenshot(process.argv[3])
break
case 'getPDF':
getPDF(process.argv[3])
break
default:
console.log('Wrong argument!')
}
Done! Now, we can run the following commands:
node index.js getPDF damcosset
node index.js getPDF ben
node index.js getPDF jess
So, this will create an instance of headless Chrome browser, travel to my page, click on the last article I wrote, travel to that page, and create a PDF with that page's content. Same thing for jess, same thing for ben.
So, now I have 3 PDFs in my pdfs folder, called:
start-before-you-are-ready-393e.pdf (Mine)
what-advice-would-you-give-someone-looking-to-use-their-keyboard-more-and-their-mouse-less-1lea.pdf (Ben)
what-was-your-win-this-week-3a9k.pdf (Jess)
Tadaaaaaaaa!
The code can be found here.
Conclusion
Ok, so this is it for the first part. Puppeteer is such a fun tool to play with, I'll make sure to come back to show you more of the amazing things we can do with it.
Have fun <3
Top comments (4)
Nice article Damien. Loves me some Puppeteer! Checkout the thing I did awhile back, I'd love some feedback as if you are so included. hub.docker.com/r/davidjeddy/docker...
Shouldn't promises from
Promise.all()
run sequential (waitForNavigation
thenclick
)?Ok, so this is a part that I still have trouble getting my head around. There are some good answers I believe on this issue.
waitForNavigation will only resolve once the navigation has started and completed. So it will resolve after the page.click has completed. If I understood it correctly.
I'd like to try this pages with SVG