Mikhail Zub for SerpApi

Posted on Sep 22, 2022

Web scraping Google Reverse Images results with Nodejs

#webscraping #node #serpapi #google

How reverse search happens

First of all, we need to paste image link to Google Image search:

Next, we need to click on the "Find image source":

What will be scraped

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search

const URL = `https://images.google.com`;

async function setImage(page) {
  const isPopup = await page.evaluate(() => {
    return Array.from(document.querySelectorAll("iframe")).find((el) => el.style.visibility !== "hidden");
  });
  if (isPopup) {
    for (let i = 0; i < 14; i++) {
      await page.keyboard.press("Tab");
      await page.waitForTimeout(500);
    }
    await page.keyboard.press("Enter");
  }
  await page.waitForTimeout(1500);
  await page.click(".nDcEnd");
  await page.waitForTimeout(1500);
  await page.click(".PXT6cd input");
  await page.keyboard.type(imageUrl);
  await page.waitForTimeout(1500);
  await page.click(".PXT6cd div");
  await page.waitForTimeout(5000);
  await page.click(".QeWRZ .WpHeLc");
}

async function fillInfoFromPage(page) {
  return await page.evaluate(async () => {
    return Array.from(document.querySelectorAll("#search .Ww4FFb")).map((el) => ({
      title: el.querySelector(".yuRUbf > a > h3").textContent.trim(),
      link: el.querySelector(".yuRUbf > a").getAttribute("href"),
      snippet: el.querySelector(".VwiC3b").textContent.trim(),
    }));
  });
}

async function getReverseImageInfo() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);
  await page.waitForSelector(".nDcEnd");

  await setImage(page);

  await page.waitForTimeout(5000);
  const pages = await browser.pages();
  const page2 = pages[pages.length - 1];

  let imageOrganicResults = [];

  while (true) {
    await page2.waitForSelector(".Ww4FFb");
    imageOrganicResults.push(...(await fillInfoFromPage(page2)));
    const nextButton = await page2.$$(".d6cvqb");
    let isButtonActive;
    if (nextButton) {
      isButtonActive = await nextButton[1]?.$("a");
    } else {
      isButtonActive = await page2.$(".acRNod");
    }
    if (!isButtonActive) break;
    await isButtonActive.click();
  }

  await browser.close();

  return imageOrganicResults;
}

getReverseImageInfo().then((result) => console.dir(result, { depth: null }));

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

Process

The first step is to extract data from HTML elements, then change the page and repeat again. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results.

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, write the image link that we want to search, and Google Image URL:

puppeteer.use(StealthPlugin());

const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search

const URL = `https://images.google.com`;

Next, we write a function to paste image URL in the Google Search:

async function setImage(page) {
  ...
}

In this function, first, we need to check if Google proposes you Sign in or register (using evaluate() andquerySelectorAll() methods to get access to right HTML selectors, and make the new array from got NodeList with Array.from(), and finally find() methods to get the necessary data from an array):

const isPopup = await page.evaluate(() => {
  return Array.from(document.querySelectorAll("iframe")).find((el) => el.style.visibility !== "hidden");
});

If it's true, we need to close this popup. Because this popup is placed in iframe element, there's challenging to get control from the Puppeteer on it. So we use the simple way, just press the "Tab" key fourteen times (using keyboard.press() method) with 0,5 sec timeout (using waitForTimeout method) until the need button has been in focus and press the "Enter" button:

if (isPopup) {
  for (let i = 0; i < 14; i++) { // 14 is the number of press the "Tab" key
    await page.keyboard.press("Tab");
    await page.waitForTimeout(500);
  }
  await page.keyboard.press("Enter");
}

Then we click on necessary buttons (using click() method) and type the imageUrl(using keyboard.type() method). Before the last click we use waitForTimeout method:

await page.waitForTimeout(1500);
await page.click(".nDcEnd");
await page.waitForTimeout(1500);
await page.click(".PXT6cd input");
await page.keyboard.type(imageUrl);
await page.waitForTimeout(1500);
await page.click(".PXT6cd div");
await page.waitForTimeout(5000);
await page.click(".QeWRZ .WpHeLc");

Next, we write a function to get need information from HTML selectors. We can do this with textContent and trim() methods, which get the raw text and removes white space from both sides of the string. If we need to get links, we use getAttribute() method to get "href" HTML element attribute:

async function fillInfoFromPage(page) {
  return await page.evaluate(async () => {
    return Array.from(document.querySelectorAll("#search .Ww4FFb")).map((el) => ({
      title: el.querySelector(".yuRUbf > a > h3").textContent.trim(),
      link: el.querySelector(".yuRUbf > a").getAttribute("href"),
      snippet: el.querySelector(".VwiC3b").textContent.trim(),
    }));
  });
}

Next, write a function to control the browser, and get information:

async function getReverseImageInfo() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: false and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: false,
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until the selector is load:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".nDcEnd");

Then, we wait until the setImage functions was finished and change the page context from the new tab (get an array with all opened pages with browser.pages() method and pick the last one):

await setImage(page);

await page.waitForTimeout(5000);
const pages = await browser.pages();
const page2 = pages[pages.length - 1];

Then we create the empty imageOrganicResults array, use while loop in which we wait for load results, and add results to the end of the imageOrganicResults array (using push() method and the spread syntax([...])).

After that we need to go to the next page. We check if the next page button is present on the page, we click it, and repeat our loop, otherwise, we end the loop:

let imageOrganicResults = [];

while (true) {
  await page2.waitForSelector(".Ww4FFb");
  imageOrganicResults.push(...(await fillInfoFromPage(page2)));
  const nextButton = await page2.$$(".d6cvqb");
  let isButtonActive;
  if (nextButton) {
    isButtonActive = await nextButton[1]?.$("a");
  } else {
    isButtonActive = await page2.$(".acRNod");
  }
  if (!isButtonActive) break;
  await isButtonActive.click();
}

And finally, we close the browser, and return the received data:

await browser.close();

return imageOrganicResults;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

[
   {
      "title":"Super Sport - Bugatti Veyron 16.4",
      "link":"https://www.bugatti.com/models/veyron-models/veyron-164-super-sport/",
      "snippet":"In the year of its market launch the Veyron 16.4 already set up a speed record for street cars. Adhering to the the Guiness World Record restrictions an ..."
   },
   {
      "title":"Bugatti Veyron - Wikipedia",
      "link":"https://en.wikipedia.org/wiki/Bugatti_Veyron",
      "snippet":"The Super Sport version of the Veyron is one of the fastest street-legal production cars in the world, with a top speed of 431.072 km/h (267.856 mph)."
   },
   ... and other results
]

Using Google Reverse Image API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to use browser automation to scrape results, create the parser from scratch and maintain it.

There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.

First, we need to install google-search-results-nodejs:

npm i google-search-results-nodejs

Here's the full code example, if you don't need an explanation:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY);

const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search

const params = {
  engine: "google_reverse_image", // search engine
  image_url: imageUrl, // search image
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const organicResults = [];
  while (true) {
    const json = await getJson();
    if (json.search_information?.organic_results_state === "Fully empty") break;
    organicResults.push(...json.image_results);
    params.start ? (params.start += 10) : (params.start = 10);
  }
  return organicResults;
};

getResults().then((result) => console.dir(result, { depth: null }));

Code explanation

First, we need to declare SerpApi from google-search-results-nodejs library and define new search instance with your API key from SerpApi:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);

Next, we write an image URL and the necessary parameters for making a request:

const imageUrl = "https://www.bugatti.com/fileadmin/_processed_/sei/p63/se-image-ce40627babaa7b180bc3dedd4354d61c.jpg"; // what we want to search

const params = {
  engine: "google_reverse_image", // search engine
  image_url: imageUrl, // search image
};

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

And finally, we declare the function getResult that gets data from the page and return it:

const getResults = async () => {
  ...
};

In this function first, we declare an array organicResults with results data:

const organicResults = [];

Next, we need to use while loop. In this loop we get json with results, check if results are present on the page (organic_results_state isn't "Fully empty"), push results to organicResults array, define the start number on the results page, and repeat the loop until results aren't present on the page:

while (true) {
  const json = await getJson();
  if (json.search_information?.organic_results_state === "Fully empty") break;
  organicResults.push(...json.image_results);
  params.start ? (params.start += 10) : (params.start = 10);
}
return organicResults;

After, we run the getResults function and print all the received information in the console with the console.dir method, which allows you to use an object with the necessary parameters to change default output options:

getResults().then((result) => console.dir(result, { depth: null }));

Output

[
   {
      "position":1,
      "title":"Best Bugatti Cars in India - CARS24",
      "link":"https://www.cars24.com/blog/best-bugatti-cars-in-india/",
      "displayed_link":"https://www.cars24.com › blog › best-bugatti-cars-in-in...",
      "thumbnail":"https://serpapi.com/searches/6319fad65c560673de2b144a/images/7c3a215cbf2776de47a9c447d0b97c5290a72394aec05f099004cb62a9250eee.jpeg",
      "image_resolution":"1920 × 1080",
      "snippet":"The Bugatti Veyron was originally launched in 2005 and was then, the fastest car ... Although the Divo is a super luxurious vehicle with a hefty price tag, ...",
      "snippet_highlighted_words":[
         "Bugatti Veyron",
         "super"
      ],
      "cached_page_link":"https://webcache.googleusercontent.com/search?q=cache:0BQ-hPQGl9IJ:https://www.cars24.com/blog/best-bugatti-cars-in-india/&cd=91&hl=en&ct=clnk&gl=us"
   },
   {
      "position":2,
      "title":"1056304 car, vehicle, road, Super Car, sports car, motion blur ...",
      "link":"https://rare-gallery.com/1056304-car-vehicle-road-super-car-sports-car-motion-blur-bugatti-bugatti-chiron-bugatti-veyron-performance-.html",
      "displayed_link":"https://rare-gallery.com › Another wallpapers",
      "thumbnail":"https://serpapi.com/searches/6319fad65c560673de2b144a/images/7c3a215cbf2776de9f26993a213d2a5cf9506bec4885508e13e62c8036abbf11.jpeg",
      "image_resolution":"1920 × 1080",
      "snippet":"Wallpaper name: car, vehicle, road, Super Car, sports car, motion blur, Bugatti, Bugatti Chiron, Bugatti Veyron, performance car, wheel, supercar, ...",
      "snippet_highlighted_words":[
         "Super",
         "Bugatti",
         "Bugatti",
         "Bugatti Veyron"
      ],
      "cached_page_link":"https://webcache.googleusercontent.com/search?q=cache:lcVswIAM3eMJ:https://rare-gallery.com/1056304-car-vehicle-road-super-car-sports-car-motion-blur-bugatti-bugatti-chiron-bugatti-veyron-performance-.html&cd=92&hl=en&ct=clnk&gl=us"
   },
   ... and other results
]

Links

If you want to see some projects made with SerpApi, write me a message.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Top comments (1)

Charles Lueilwitz • Feb 25

Reverse image scraping is a huge topic for digital rights. Many creators we work with at Erasa use this type of technology to track unauthorized content use. Seeing the DIY Node.js approach here is very helpful for understanding how these platforms cross-reference metadata. Quality post!