In this tutorial, we are going to build a webpage image downloader. Assuming you visit a webpage and saw that the images in that page are cool and you want to have your own copy without downloading them one by one, this simple tool we will build is going to be a life saver for you. This little project is also a good way to practice and hone your webscraping skills.
We will create a new directory called image-downloader
and navigate into it. Pop open your terminal window and type in the following commands.
mkdir image-downloader && cd image-downloader
I will assume that you have node js and npm installed on your machine. We will then initialize this directory with the standard package.json
file by running npm init -y
and then install two dependencies namely puppeteer
and node-fetch
. Run the following commands to get them installed.
npm install --save puppeteer node-fetch --verbose
You probably just saw a new npm flag --verbose
. When installing puppeteer, what happens behind the scenes is that npm
also installs the chromium browser because it is a dependency of puppeteer
. This file is usually large and we are using the --verbose
flag to see the progress of the installation, nothing fancy, but let's just use it because we can.
One more thing to do before getting our hands dirty with code is to create a directory where we want all our images to be downloaded. Let's name that directory images
. We will also create index.js
file where all the app's logic will go.
mkdir images && touch index.js
Actually, it's great to clearly outline our thought process before writing a single line of code.
- Get all image tags from the page and extract the
href
property from each of these image tags - Make request to those
href
links and store them into theimages
directory (Saving images to disk)
Step one 1: Getting all image tags and href
property
'use strict';
const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')
// Extract all imageLinks from the page
async function extractImageLinks(){
const browser = await puppeteer.launch({
headless: false
})
const page = await browser.newPage()
// Get the page url from the user
let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"
try {
await page.goto(baseURL, {waitUntil: 'networkidle0'})
await page.waitForSelector('body')
let imageBank = await page.evaluate(() => {
let imgTags = Array.from(document.querySelectorAll('img'))
let imageArray = []
imgTags.map((image) => {
let src = image.src
let srcArray = src.split('/')
let pos = srcArray.length - 1
let filename = srcArray[pos]
imageArray.push({
src,
filename
})
})
return imageArray
})
await browser.close()
return imageBank
} catch (err) {
console.log(err)
}
}
Now let me explain what is happening here. First, we created an async
function called extractImageLinks
. In that function, we created an instance of a browser page using puppeteer and stored it in the page
constant. Think of this page
as the new page you get after launching your chrome browser. We can now heedlessly control this page from our code. We then get the url of the page we want to download the image from the user and stored it in a variable named baseURL
. We then navigate to that URL using the page.goto()
function. The {waitUntil: 'networkidle0'}
object passed as the second argument to this function is to ensure that we wait for the for the network request to complete before we proceed with parsing the page. page.waitForSelector('body')
is telling puppeteer to wait for the html body
tag to render before we start extracting anything from the page.
The page.evaluate()
function allows us to run JavaScript code in that page instance as if we were in our Google Chrome Devtools. To get all image tags from the page, we call the document.querySelectorAll("img")
function. However, this function returns an NodeList
and not an array. So to convert this to an array, we wrapped the first function with the Array.from()
method. Now we have an array to work with.
We then store all the image tags in the imgTags
variable and initialized imageArray
variable as a placeholder for all the href
values. Since imgTags
has been converted into an array, we then map through every tag in that array and extract the src
property from each image tag.
Now time for some little hack, we want to download the image from the webpage maintianing the original filename as it appears in the webpage. For instance, we have this image src https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg
. We wnat to get the green-leaf_BVKZ4QW8LS.jpg
from that URL. One way to do this is to split the string using the "/"
delimeter. We then end up with something like this:
let src = `https://cdn.stocksnap.io/img-thumbs/960w/green-leaf_BVKZ4QW8LS.jpg`.split("/")
// Output
["https:", "", "cdn.stocksnap.io", "img-thumbs", "960w", "green-leaf_BVKZ4QW8LS.jpg"]
Now the last index of the array after running the split
array method on the image source contains the image's name and the extension as well, awesome!!!
Note: to get the last item from any array, we subtract 1
from the length
m of that array like so:
let arr = [40,61,12]
let lastItemIndex = arr.length - 1 // This is the index of the last item
console.log(lastItemIndex)
// Output
2
console.log(arr[lastItemIndex])
// Output
12
So we store the index of the last item in the pos
variable and then store the name of the file in the filename
variable as well. Now we have the source of the file and the file name of the current image in the loop, we then push these values as an object in the imageArray
variable. After the mapping is done, we return the imageArray
because by now it has been populated. We also return the imageBank
variable which now contains the image links (sources) and the filenames.
Saving images to disk
function saveImageToDisk(url, filename){
fetch(url)
.then(res => {
const dest = fs.createWriteStream(filename);
res.body.pipe(dest)
})
.catch((err) => {
console.log(err)
})
}
// Run the script on auto-pilot
(async function(){
let imageLinks = await extractImageLinks()
console.log(imageLinks)
imageLinks.map((image) => {
let filename = `./images/${image.filename}`
saveImageToDisk(image.src, filename)
})
})()
Now let's decipher this little piece. In the anonymous IIFE, we are running the extractImageLinks
to get the array containing the src
and filename
. Since the function is returns an array, we run the map
function on that array and then pass the required parameters (url
and filename
) to saveImageToDisk
. We then use the fetch
API to make a GET
request to that url
and as the response is coming down the wire, we are concurrently piping it into the filename
destination, in this case, a writable stream on our filesystem. This is very efficient because we are not waiting for the image to be fully loaded in memory before saving to disk but instead saving every chunk we get from the response directly.
Lets's run the code, cross our fingers and checkout our images
directory
node index.js https://stocksnap.io
We should see some cool images there in. Wooo! You can add this to your portfolio. There are so many improvements that can be done to this little software, such as allowing the user to specify the directory they want to download the image, handling Data URI
images, proper error handling, code refactoring, creating a standalone CLI utility for it. Hint: use the commander
npm package for that, etc. You can go ahead and extend this app and I'll be glad to see what improvements you will make it.
Full code
'use strict';
const puppeteer = require('puppeteer');
const fetch = require('node-fetch');
const fs = require('fs')
// Browser and page instance
async function instance(){
const browser = await puppeteer.launch({
headless: false
})
const page = await browser.newPage()
return {page, browser}
}
// Extract all imageLinks from the page
async function extractImageLinks(){
const {page, browser} = await instance()
// Get the page url from the user
let baseURL = process.argv[2] ? process.argv[2] : "https://stocksnap.io"
try {
await page.goto(baseURL, {waitUntil: 'networkidle0'})
await page.waitForSelector('body')
let imageLinks = await page.evaluate(() => {
let imgTags = Array.from(document.querySelectorAll('img'))
let imageArray = []
imgTags.map((image) => {
let src = image.src
let srcArray = src.split('/')
let pos = srcArray.length - 1
let filename = srcArray[pos]
imageArray.push({
src,
filename
})
})
return imageArray
})
await browser.close()
return imageLinks
} catch (err) {
console.log(err)
}
}
(async function(){
console.log("Downloading images...")
let imageLinks = await extractImageLinks()
imageLinks.map((image) => {
let filename = `./images/${image.filename}`
saveImageToDisk(image.src, filename)
})
console.log("Download complete, check the images folder")
})()
function saveImageToDisk(url, filename){
fetch(url)
.then(res => {
const dest = fs.createWriteStream(filename);
res.body.pipe(dest)
})
.catch((err) => {
console.log(err)
})
}
Cheers! 🎉
Top comments (6)
Hi there,
An improvement to this code is to automatically create the "images" folder for doing it totally transparent for the user.
Thanks for your amazing & wonderful article.
Following you. :)
Warmest regards.
That's very thoughtful, I will update the article and source code as soon as I'm less busy. Thanks for the kind words mate, really appreciate. 👍
I built a similar image scraper but using
Selenium
.Yours is incredibly robust and is handling so many edge cases. Great job mate, I have gotten more inspiration from that repo 😄
I have a weird feeling that a strategic use of Promise.all would make it order of magnitude faster :)
Yes you are very correct. I just took the naive approach building it. Performance is something that can definitely be improved. I'll create a repo for this project and then welcome pull requests from the community 😄