Antoine Mesnil

Posted on Nov 4, 2022 • Edited on Nov 8, 2022 • Originally published at antoinemesnil.com

Scrape images from a search engine with JavaScript and Puppeteer

#node #tutorial #javascript #duckduckgo

Introduction

In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.

In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.

Script to get the images links

You should have Node.js and Puppeteer installed with npm or yarn.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: dog breeds dataset

As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.

const puppeteer = require("puppeteer")
const data = require("./dog-breeds.json")

const script = async () => {
  //this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script
  const browser = await puppeteer.launch({ headless: false, slowMo: 100 })
  const page = await browser.newPage()

  //loop on every breed
  for (let dogBreed of data) {
    console.log("Start for breed:", dogBreed)
    const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll(
      " ",
      "+"
    )}&va=b&t=hc&iar=images&iax=images&ia=images`

    //in case we encounter a page without images or an error
    try {
      await page.goto(url)

      //make sure the page is loaded and contain our targeted element
      await page.waitForNavigation()
      await page.waitForSelector(".tile--img__media")

      await page.evaluate(
        () => {
          const firstImage = document.querySelector(".tile--img__media")
          //we open the panel that contains the image info
          firstImage.click()
        },
        { delay: 400 }
      )

      //get the link of the image from the panel
      await page.waitForSelector(".detail__pane a")
      const link = await page.evaluate(
        () => {
          const links = document.querySelectorAll(".detail__pane a")
          const linkImage = Array.from(links).find((item) =>
            item.innerText.includes("fichier")
          )
          return linkImage?.getAttribute("href")
        },
        { delay: 250 }
      )
      console.log("link succesfully retrieved:", link)
      console.log("=====")
    } catch (e) {
      console.log(e)
    }
  }
}

script()

After running the script with node scrapeImages.js you should get something like this:

Download and optimize the images

We now have the links of every images but some of them are quite heavy (>1mb).
Fortunately we can use another Node.js library to compress their size with minimal loss of quality: sharp

It is a massively used library (2M+ weekly download) to convert, resize and optimize images.

You can add this at the end of the script to have a folder filled with the optimized images

const stream = fs.createWriteStream(dogBreed + ".jpg")
await https.get(link, async function(response) {
  response.pipe(stream)
  stream.on("finish", () => {
    stream.close()
    console.log("Download Completed")
  })
})

//resize to a maximum width or height of 1000px
await sharp(`./${dogBreed}.jpg`)
  .resize(1000, 1000)
  .toFile(`./${dogBreed}-small.jpg`)

Conclusion

You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on https://dreamclimate.city

😄 Thanks for reading! If you found this article useful, consider to follow me on Twitter, I share tips on development, design and share my journey to create my own startup studio

Create and maintain end-to-end frontend tests

Learn best practices on creating frontend tests, testing on-premise apps, integrating tests into your CI/CD pipeline, and using Datadog’s testing tunnel.

Download The Guide

DEV Community

Scrape images from a search engine with JavaScript and Puppeteer

Introduction

Script to get the images links

Download and optimize the images

Conclusion

😄 Thanks for reading! If you found this article useful, consider to follow me on Twitter, I share tips on development, design and share my journey to create my own startup studio

Create and maintain end-to-end frontend tests

Top comments (0)

Read next

How to Set Up CopilotKit in Your React App: A Step-by-Step Guide

A Quick Review of SQL Window Functions with Examples

How to visualize bar chart with react-chart-2, showing label on the bar

Why You Should Try a Local LLM Model—and How to Get Started